Comparing audio recordings - linux

I have 5 recorded wav files. I want to compare the new incoming recordings with these files and determine which one it resembles most.
In the final product I need to implement it in C++ on Linux, but now I am experimenting in Matlab. I can see FFT plots very easily. But I don't know how to compare them.
How can I compute the similarity of two FFT plots?
Edit: There is only speech in the recordings. Actually, I am trying to identify the response of answering machines of a few telecom companies. It's enough to distinguish two messages "this person can not be reached at the moment" and "this number is not used anymore"

This depends a lot on your definition of "resembles most". Depending on your use case this can be a lot of things. If you just want to compare the bare spectra of the whole file you can just correlate the values returned by the two ffts.
However spectra tend to change a lot when the files get warped in time. To figure out the difference with this, you need to do a windowed fft and compare the spectra for each window. This then defines your difference function you can use in a Dynamic time warping algorithm.
If you need perceptual resemblance an FFT probably does not get you what you need. An MFCC of the recordings is most likely much closer to this problem. Again, you might need to calculate windowed MFCCs instead of MFCCs of the whole recording.
If you have musical recordings again you need completely different aproaches. There is a blog posting that describes how Shazam works, so you might be able to find this on google. Or if you want real musical similarity have a look at this book
EDIT:
The best solution for the problem specified above would be the one described here ("shazam algorithm" as mentioned above).This is however a bit complicated to implement and easier solution might do well enough.

If you know that there are only 5 different different possible incoming files, I would suggest trying first something as easy as doing the euclidian distance between the two signals (in temporal or fourier). It is likely to give you good result.
Edit : So with different possible starts, try doing an autocorrelation and see which file has the higher peak.

I suggest you compute simple sound parameter like fundamental frequency. There are several methods of getting this value - I tried autocorrelation and cepstrum and for voice signals they worked fine. With such function working you can make time-analysis and compare two signals (base - to which you compare, in - which you would like to match) on given interval frequency. Comparing several intervals based on such criteria can tell you which base sample matches the best.
Of course everything depends on what you mean resembles most. To compare function you can introduce other parameters like volume, noise, clicks, pitches...

Related

Determining the 'amount' of speaking in a video

I'm working on a project to transcribe lecture videos. We are currently just using humans to do the transcriptions as we believe it is easier to transcribe than editing ASR, especially for technical subjects (not the point of my question, though I'd love any input on this). From our experiences we've found that after about 10 minutes of transcribing we get anxious or lose focus. Thus we have been splitting videos into ~5-7 minute chunks based on logical breaks in the lecture content. However, we've found that the start of a lecture (at least for the class we are piloting) often has more talking than later on, which often has time where the students are talking among themselves about a question. I was thinking that we could do signal processing to determine the rough amount of speaking throughout the video. The idea is to break the video into segments containing roughly the same amount of lecturing, as opposed to segments that are the same length.
I've done a little research into this, but everything seems to be a bit overkill for what I'm trying to do. The videos for this course, though we'd like to generalize, contain basically just the lecturer with some occasional feedback and distant student voices. So can I just simply look at the waveform and roughly use the spots containing audio over some threshold to determine when the lecturer is speaking? Or is an ML approach really necessary to quantify the lecturer's speaking?
Hope that made sense, I can clarify anything if necessary.
Appreciate the help as I have no experience with signal processing.
Although there are machine learning mehtods that are very good at discriminating voice from other sounds, you don't seem to require that sort of accuracy for your application. A simple level-based method similar to the one you proposed should be good enough to get you an estimate of speaking time.
Level-Based Sound Detection
Goal
Given an audio sample, discriminate the portions with a high amount of sounds from the portions that consist of background noise. This can then be easily used to estimate the amount of speech in a sound file.
Overview of Method
Rather than looking at raw levels in the signal, we will first convert it to a sliding-window RMS. This gives a simple measure of how much audio energy is at any given point of the audio sample. By analyzing the RMS signal we can automatically determine a threshold for distinguishing between backgroun noise and speech.
Worked Example
I will be working this example in MATLAB because it makes the math easy to do and lets me create illustrations.
Source Audio
I am using President Kennedy's "We choose to go to the moon" speech. I'm using the audio file from Wikipedia, and just extracting the left channel.
imported = importdata('moon.ogg');
audio = imported.data(:,1);
plot(audio);
plot((1:length(audio))/imported.fs, audio);
title('Raw Audio Signal');
xlabel('Time (s)');
Generating RMS Signal
Although you could techinically implement an overlapping per-sample sliding window, it is simpler to avoid the overlapping and you'll get very similar results. I broke the signal into one-second chunks, and stored the RMS values in a new array with one entry per second of audio.
audioRMS = [];
for i = 1:imported.fs:(length(audio)-imported.fs)
audioRMS = [audioRMS; rms(audio(i:(i+imported.fs)))];
end
plot(1:length(audioRMS), audioRMS);
title('Audio RMS Signal');
xlabel('Time (s)');
This results in a much smaller array, full of positive values representing the amount of audio energy or "loudness" per second.
Picking a Threshold
The next step is to determine how "loud" is "loud enough." You can get an idea of the distribution of noise levels with a histogram:
histogram(audioRMS, 50);
I suspect that the lower shelf is the general background noise of the crowd and recording environment. The next shelf is probably the quieter applause. The rest is speech and loud crowd reactions, which will be indistinguishable to this method. For your application, the loudest areas will almost always be speech.
The minimum value in my RMS signal is .0233, and as a rough guess I'm going to use 3 times that value as my criterion for noise. That seems like it will cut off the whole lower shelf and most of the next one.
A simple check against that threshold gives a count of 972 seconds of speech:
>> sum(audioRMS > 3*min(audioRMS))
ans =
972
To test how well it actually worked, we can listen to the audio that was eliminated.
for i = 1:length(speech)
if(~speech(i))
clippedAudio = [clippedAudio; audio(((i-1)*imported.fs+1):i*imported.fs)];
end
end
>> sound(clippedAudio, imported.fs);
Listening to this gives a bit over a minute of background crowd noise and sub-second clips of portions of words, due to the one-second windows used in the analysis. No significant lengths of speech are clipped. Doing the opposite gives audio that is mostly the speech, with clicks heard as portions are skipped. The louder applause breaks also make it through.
This means that for this speech, the threshold of three times the minimum RMS worked very well. You'll probably need to fiddle with that ratio to get good automatic results for your recording environment, but it seems like a good place to start.

How to predict when next event occurs based on previous events? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
Basically, I have a reasonably large list (a year's worth of data) of times that a single discrete event occurred (for my current project, a list of times that someone printed something). Based on this list, I would like to construct a statistical model of some sort that will predict the most likely time for the next event (the next print job) given all of the previous event times.
I've already read this, but the responses don't exactly help out with what I have in mind for my project. I did some additional research and found that a Hidden Markov Model would likely allow me to do so accurately, but I can't find a link on how to generate a Hidden Markov Model using just a list of times. I also found that using a Kalman filter on the list may be useful but basically, I'd like to get some more information about it from someone who's actually used them and knows their limitations and requirements before just trying something and hoping it works.
Thanks a bunch!
EDIT: So by Amit's suggestion in the comments, I also posted this to the Statistics StackExchange, CrossValidated. If you do know what I should do, please post either here or there
I'll admit it, I'm not a statistics kind of guy. But I've run into these kind of problems before. Really what we're talking about here is that you have some observed, discrete events and you want to figure out how likely it is you'll see them occur at any given point in time. The issue you've got is that you want to take discrete data and make continuous data out of it.
The term that comes to mind is density estimation. Specifically kernel density estimation. You can get some of the effects of kernel density estimation by simple binning (e.g. count the number events in a time interval such as every quarter hour or hour.) Kernel density estimation just has some nicer statistical properties than simple binning. (The produced data is often 'smoother'.)
That only takes care of one of your problems, though. The next problem is still the far more interesting one -- how do you take a time line of data (in this case, only printer data) and produced a prediction from it? First thing's first -- the way you've set up the problem may not be what you're looking for. While the miracle idea of having a limited source of data and predicting the next step of that source sounds attractive, it's far more practical to integrate more data sources to create an actual prediction. (e.g. maybe the printers get hit hard just after there's a lot of phone activity -- something that can be very hard to predict in some companies) The Netflix Challenge is a rather potent example of this point.
Of course, the problem with more data sources is that there's extra legwork to set up the systems that collect the data then.
Honestly, I'd consider this a domain-specific problem and take two approaches: Find time-independent patterns, and find time-dependent patterns.
An example time-dependent pattern would be that every week day at 4:30 Suzy prints out her end of the day report. This happens at specific times every day of the week. This kind of thing is easy to detect with fixed intervals. (Every day, every week day, every weekend day, every Tuesday, every 1st of the month, etc...) This is extremely simple to detect with predetermined intervals -- just create a curve of the estimated probability density function that's one week long and go back in time and average the curves (possibly a weighted average via a windowing function for better predictions).
If you want to get more sophisticated, find a way to automate the detection of such intervals. (Likely the data wouldn't be so overwhelming that you could just brute force this.)
An example time-independent pattern is that every time Mike in accounting prints out an invoice list sheet, he goes over to Johnathan who prints out a rather large batch of complete invoice reports a few hours later. This kind of thing is harder to detect because it's more free form. I recommend looking at various intervals of time (e.g. 30 seconds, 40 seconds, 50 seconds, 1 minute, 1.2 minutes, 1.5 minutes, 1.7 minutes, 2 minutes, 3 minutes, .... 1 hour, 2 hours, 3 hours, ....) and subsampling them via in a nice way (e.g. Lanczos resampling) to create a vector. Then use a vector-quantization style algorithm to categorize the "interesting" patterns. You'll need to think carefully about how you'll deal with certainty of the categories, though -- if your a resulting category has very little data in it, it probably isn't reliable. (Some vector quantization algorithms are better at this than others.)
Then, to create a prediction as to the likelihood of printing something in the future, look up the most recent activity intervals (30 seconds, 40 seconds, 50 seconds, 1 minute, and all the other intervals) via vector quantization and weight the outcomes based on their certainty to create a weighted average of predictions.
You'll want to find a good way to measure certainty of the time-dependent and time-independent outputs to create a final estimate.
This sort of thing is typical of predictive data compression schemes. I recommend you take a look at PAQ since it's got a lot of the concepts I've gone over here and can provide some very interesting insight. The source code is even available along with excellent documentation on the algorithms used.
You may want to take an entirely different approach from vector quantization and discretize the data and use something more like a PPM scheme. It can be very much simpler to implement and still effective.
I don't know what the time frame or scope of this project is, but this sort of thing can always be taken to the N-th degree. If it's got a deadline, I'd like to emphasize that you worry about getting something working first, and then make it work well. Something not optimal is better than nothing.
This kind of project is cool. This kind of project can get you a job if you wrap it up right. I'd recommend you do take your time, do it right, and post it up as function, open source, useful software. I highly recommend open source since you'll want to make a community that can contribute data source providers in more environments that you have access to, will to support, or time to support.
Best of luck!
I really don't see how a Markov model would be useful here. Markov models are typically employed when the event you're predicting is dependent on previous events. The canonical example, of course, is text, where a good Markov model can do a surprisingly good job of guessing what the next character or word will be.
But is there a pattern to when a user might print the next thing? That is, do you see a regular pattern of time between jobs? If so, then a Markov model will work. If not, then the Markov model will be a random guess.
In how to model it, think of the different time periods between jobs as letters in an alphabet. In fact, you could assign each time period a letter, something like:
A - 1 to 2 minutes
B - 2 to 5 minutes
C - 5 to 10 minutes
etc.
Then, go through the data and assign a letter to each time period between print jobs. When you're done, you have a text representation of your data, and that you can run through any of the Markov examples that do text prediction.
If you have an actual model that you think might be relevant for the problem domain, you should apply it. For example, it is likely that there are patterns related to day of week, time of day, and possibly date (holidays would presumably show lower usage).
Most raw statistical modelling techniques based on examining (say) time between adjacent events would have difficulty capturing these underlying influences.
I would build a statistical model for each of those known events (day of week, etc), and use that to predict future occurrences.
I think the predictive neural network would be a good approach for this task.
http://en.wikipedia.org/wiki/Predictive_analytics#Neural_networks
This method is also used for predicting f.x. weather forecasting, stock marked, sun spots.
There's a tutorial here if you want to know more about how it works.
http://www.obitko.com/tutorials/neural-network-prediction/
Think of a markov chain like a graph with vertex connect to each other with a weight or distance. Moving around this graph would eat up the sum of the weights or distance you travel. Here is an example with text generation: http://phpir.com/text-generation.
A Kalman filter is used to track a state vector, generally with continuous (or at least discretized continuous) dynamics. This is sort of the polar opposite of sporadic, discrete events, so unless you have an underlying model that includes this kind of state vector (and is either linear or almost linear), you probably don't want a Kalman filter.
It sounds like you don't have an underlying model, and are fishing around for one: you've got a nail, and are going through the toolbox trying out files, screwdrivers, and tape measures 8^)
My best advice: first, use what you know about the problem to build the model; then figure out how to solve the problem, based on the model.

Simple audio filter-bank

I'm new to audio filters so please excuse me if i'm saying something wrong.
I like to write a code which can split up audio stored in PCM samples into two or three frequency bands and do some manipulation (like modifying their audio levels) or analysis on them then reconstruct audio samples from the output.
As far as i read on the internet for this task i could use FFT-IFFT and do manipulation on the complex form or use a time domain based filterbank which for example is used by the MP2 audio encoding format. Maybe a filter-bank is a better choice, at least i read somewhere it can be more CPU usage friendly in real time streaming environments. However i'm having hard times understanding the mathematical stuff behind a filterbank. I'm trying to find some source code (preferably in Java or C/C++) about this topic, so far with no success.
Can somebody provide me tips or links which can get me closer to an example filter bank?
Using FFT to split an Audio signal into few bands is overkill.
What you need is one or two Linkwitz-Riley filters. These filters split a signal into a high and low frequency part.
A nice property of this filter is, that if you add the low and high frequency parts you get almost the original signal back. There will be a little bit of phase-shift but the ear will not be able to hear this.
If you need more than two bands you can chain the filters. For example if you want to separate the signal at 100 and 2000Hz it would in pseudo-code somewhat like this:
low = linkwitz-riley-low (100, input-samples)
temp = linkwitz-riley-high (100, input-samples)
mids = linkwitz-riley-low (2000, temp)
highs = linkwitz-riley-high (2000, temp);
and so on..
After splitting the signal you can for example amplifiy the three output bands: low, mids and highs and later add them together to get your processed signal.
The filter sections itself can be implemented using IIR filters. A google search for "Linkwitz-Riley digital IIR" should give lots of good hits.
http://en.wikipedia.org/wiki/Linkwitz-Riley_filter
You should look up wavelets, especially Daubechies wavelets. They will let you do the trick, they're FIR filters and they're really short.
Update
Downvoting with no explanation isn't cool. Additionally, I'm right. Wavelets are filter banks and their job is to do precisely what is described in the question. IMHO, that is. I've done it many times myself.
There's a lot of filter source code to be found here

How to distinguish chords from single notes?

I am a bit stuck here as I cant seem to find some algorithms in trying to distinguish whether a sound produced is a chord or a single note. I am working on Guitar instrument.
Currently, what I am experimenting on is trying to get the Top 5 frequencies with the highest amplitudes, and then determining if they are harmonics of the fundamental (the one with the highest amplitude) or not. I am working on the theory that single notes contain more harmonics than chords, but I am unsure as to if this is the case.
Another thing I am considering is trying to add in the various amplitude values of the harmonics as well as comparing notes comprising the 'supposed chord' to the result from the FFT.
Can you help me out here? It would be really appreciated. Currently, I am only working on Major and Minor chords first.
Thank you very much!
Chord recognition is still a research topic. A good solution might require some fairly sophisticated AI pattern matching techniques. The International Society for Music Information Retrieval seems to run an annual contest on automatic transcription type problems. You can look up the conference and research papers on what has been tried, and how well it works.
Also note that the fundamental pitch is not necessarily the frequency with the highest FFT amplitude result. With a guitar, it very often is not.
You need to think about it in terms of the way we hear sound. Looking for the top 5 frequencies isnt going to do you any good.
You need to look for all frequencies within (Max Frequency Amplitude)/srt(2) to determin the chord/not chord aspect of the signal.

What properties of sound can be represented / computed in code?

This one is probably for someone with some knowledge of music theory. Humans can identify certain characteristics of sounds such as pitch, frequency etc. Based on these properties, we can compare one sound to another and get a measure pf likeliness. For instance, it is fairly easy to distinguish the sound of a piano from that of a guitar, even if both are playing the same note.
If we were to go about the same process programmatically, starting with two audio samples, what properties of the sounds could we compute and use for our comparison? On a more technical note, are there any popular APIs for doing this kind of stuff?
P.S.: Please excuse me if I've made any elementary mistakes in my question or I sound like a complete music noob. Its because I am a complete music noob.
There are two sets of properties.
The "Frequency Domain" -- the amplitudes of overtones in a specific sample. This is the amplitudes of each overtone.
The "Time Domain" -- the sequence of amplitude samples through time.
You can, using Fourier Transforms, convert between the two.
The time domain is what sound "is" -- a sequence of amplitudes. The frequency domain is what we "hear" -- a set of overtones and pitches that determine instruments, harmonies, and dissonance.
A mixture of the two -- frequencies varying through time -- is the perception of melody.
The Echo Nest has easy-to-use analysis apis to find out all you might want to know about a piece of music.
You might find the analyze documentation (warning, pdf link) helpful.
Any and all properties of sound can be represented / computed - you just need to know how. One of the more interesting is spectral analysis / spectrogramming (see http://en.wikipedia.org/wiki/Spectrogram).
Any properties you want can be measured or represented in code. What do you want?
Do you want to test if two samples came from the same instrument? That two samples of different instruments have the same pitch? That two samples have the same amplitude? The same decay? That two sounds have similar spectral centroids? That two samples are identical? That they're identical but maybe one has been reverberated or passed through a filter?
Ignore all the arbitrary human-created terms that you may be unfamiliar with, and consider a simpler description of reality.
Sound, like anything else that we perceive is simply a spatial-temporal pattern, in this case "of movement"... of atoms (air particles, piano strings, etc.). Movement of objects leads to movement of air that creates pressure waves in our ear, which we interpret as sound.
Computationally, this is easy to model; however, because this movement can be any pattern at all -- from a violent random shaking to a highly regular oscillation -- there often is no constant identifiable "frequency", because it's often not a perfectly regular oscillation. The shape of the moving object, waves reverberating through it, etc. all cause very complex patterns in the air... like the waves you'd see if you punched a pool of water.
The problem reduces to identifying common patterns and features of movement (at very high speeds). Because patterns are arbitrary, you really need a system that learns and classify common patterns of movement (i.e. movement represented numerically in the computer) into various conceptual buckets of some sort.

Resources