How to compare spoken audio against reference recording - language learning - linux

I am looking for a way to compare a user submitted audio recording against a reference recording for comparison in order to give someone a grade or percentage for language learning.
I realize that this is a very un-scientific way of doing things and is more than a gimmick than anything.
My first thoughts are some sort of audio fingerprinting, or waveform comparison.
Any ideas where I should be looking?

This is by no means a trivial problem to solve, though there is an abundance of research on the topic. Presently the most successful forms of machine learning in the speech recognition domain apply Hidden Markov Model techniques.
You may also want to take a look at existing implementations of HMM algorithms. One such library in its early stages is ghmm.
Perhaps even better and more readily applicable to your problem is HTK.

In addition to chomp's great answer, one important keyword you probably need to look up is Dynamic Time Warping (DTW). This is the wikipedia article: http://en.wikipedia.org/wiki/Dynamic_time_warping

Related

APCS final project: Converting an audio file to a simpler MIDI file

Lets say I have the audio file for Happy Birthday. I want to convert that audio file into an audio file that sounds like this : happy birthday.
First, I'd like to know if I have the ability to program this? Can a highschooler who's almost finished with APCS program this?
If I can:
How would I change the bpm of the song? I've searched through a bunch of websites, but they weren't very helpful.
I know that audio files can be represented in waveforms. How would I scan for each individual wave in an audio file (I need this to isolate the notes)?
This is a very ambitious project, actually. One reason is that it involves using digital signal processing tools like FFT (Fast fourier transforms) to analyze the sound to pick out the pitches. You might be able to find a library that can do this, but as far as coding such a tool, that would involve a steep learning curve.
If you would like to look further into this, there is a good online resource called "The Scientists and Engineers Guide to Digital Signal Processing". I was able to work through and understand the discrete fourier transform with only high school math (lots of trig) and a bit of calculus. It was a lift, though.
Trying to analyze rhythm is also no easy task. Even with advanced tools provided in professional notation system such as Finale, people have trouble playing rhythms in time well enough for the best transcription tools. Algorithms that "quantize" the beats help but also limit the amount of detail that can be included in the playback.
My guess is that as interesting and worthwhile as this project would be, to bring it to completion before the semester ends would require putting together prebuilt pieces. A lot of programming is done that way, these days.
If you scale the project back to something like just getting your code to analyze a short sample of a single note and give its pitch, that would be both impressive and doable with a lot of work. It could be done with a DFT algorithm instead of requiring FFT, reducing the amount of info you'd have to acquire first. That way, you'd only have to work your way up to understanding and implementing the material on this link which is about calculating the DFT. Notice that there is example code in BASIC. The code examples throughout this book are a big help.

How does the Ableton Drum-To-MIDI function work?

I can't seem to find any information regarding the process that Ableton uses to efficiently detect atonal percussion and convert it into MIDI. I assume feature extraction and onset detection algorithms are executed, but I'm intrigued as to what algorithms. I am particularly interesting how its efficiency is maintained for a beatboxed input.
Cheers
Your guesses are as good as everyone else's - although they look plausible. The reality is that the way this feature is implemented in Ableton is a trade secret and likely to remain that way.
If I'm not mistaken Ableton licenses technology from https://www.zplane.de/ for these things.
I don't exactly know how the software assigns the different drum sounds, but the chapter in the live manual Convert Drums to New MIDI Track says that it can only detect kick, snare and hi-hat. An important thing is that they are identified by the transient Markers. For a good result you should manually check and adjust them. The transient Markers look like the warp Markers, but are grey.
compared to a kick and a snare for example, a beatboxed input is likely to have less difference between the individual sounds and therefore likely to be harder for Ableton to individually extract the seperate sounds (depends on the beatboxer). In any case, some combination of frequency and amplitude - more specifically(Attack, Decay, Sustain, Release) as well as perhaps the different overtone combinations that account for differences in timbre are going to be the characteristics that would have to be evaluated in order to separate the kick snare and hihat .
Before this feature existed I used gates and hi/low pass filters to accomplish a similar task. So perhaps Ableton's solution is not as complicated as we might imagine.

Change pitch of audio buffer

I am trying to change the pitch of a buffer sample using a scriptprocessor, but what kind of formulas do I need to do this? I am not looking for the exact js code, but just for some general mathematical how to. I would love to have some code for this, as the first answer has a lot of formulas where I have no idea on how to implement that in JS.
I know that this is working with time, but according to this it can be done with the FFT, but I have no idea how one should do that.
For one method of doing time-pitch modification using an FFT, look up phase vocoder. Here's one explanation of how a phase vocoder works (but a search will turn up many others): http://www.guitarpitchshifter.com/algorithm.html
I believe https://github.com/mikolalysenko/pitch-shift would be appropriate (the quality is not on par with other code, but this library is rather easy to understand/use). You can hear a demo at http://mikolalysenko.github.io/pitch-shift/.

Implementing "best match" for sound effects

I am looking for some advice on categorizing a library of sound effects. I have a large set of random sound effects, (think whistles, pops, growls, creaks, gunshots etc). I would like to be able to take a growl for example, and find the next growl that sounds the closest to the original.
Given a sound, what sound from my set sounds the closest to it.
I have done a fair amount of googling and have found two avenues that I am still researching. One is using echonest, although their "best match" support looks not promising for public users. The other option is diving into FFT and building my own matching algorithm. This is a fine option and would be a great learning experience but I wanted to get some opinions from others who might know a little more about sound processing; especially short clips .5sec - 3sec range, not full length music.
Thanks!
I have worked in movie postproduction for years and as far as I know, there is no way to do that automatically. Every file has meta information in its file header which describes what the sound is like. You are then actually not searching for the file names but in the meta string.
I don't think that it is trivial to sort effects programmatically as two effects that sound similar might be totally different if you look at the waveform.
You would need to extract significant information about a sound that you can then compare.
I am also not a DSP expert, maybe there are methods to do this
If you're interested in trying to build your own system to do this, I can suggest a few keywords that might help to refine your Google searches. In the academic research community, the task you're describing is often called "content-based audio searching". I know there's been a lot of work done on it, and though most pertains to music, sound effects have definitely been the focus of a number of studies.
You might want to start with the work of Pedro Cano.
Also, I recently heard about a company that's doing similar work. You might want to check out products from Imagine Research.
Those are just a couple of ideas off the top of my head. I'm not %100 sure they'll be helpful. If they are, please let me know!

How to amplify certain audio samples, particularly amplifying a certain frequency?

Can anyone provide sample pseudocode or share some existing link that has sample code.
Like for example I have a mix audio of 1kHz or 2kHz or 8kHz or so, and I want to boost certain frequencies like 1kHz only in real-time.
Reading some DSP books and resources confuses me.
You just need to design and implement a suitable digital filter. This is a large and complex subject area though, so you won't get a simple answer here. Probably the best thing as a first step would be to read a good introductory book on DSP, e.g. Understanding DSP by Rick Lyons, which is a very good for beginners as it's not too heavy on the math and has a more practical bent than most such introductory DSP books.
For this particular application though what you are trying to do is similar to implementing a graphic equalizer, and there are many pointers to how to implement this kind of thing if you use e.g. "graphic equalizer" as a search term.
There's a lot of math behind digital filtering. Sorry, I think it is important to at least understand basic filters (like those used in electronics). If you don't want to go through the basics: best to get an audio graphics equaliser where you can play with the (virtual) sliders. If you want to implement a very specific filter, please read on.
Real time: depends on your computing platform. If this is a small micro (like AVR, Microchip PIC,..) you'll need an efficient algorithm. This is likely a IIR band pass filter. The equivalent of a graphics equaliser consists of multiple band pass filters, all summed together. See http://en.wikipedia.org/wiki/Infinite_impulse_response
A more computing intensive algorithm uses FIR filters. In that case you can also control the phase of the filtered signal. http://en.wikipedia.org/wiki/Finite_impulse_response
If you find an algorithm (i.e. IIR), you'll need to calculate the coefficients. The algorithm is simple, calculating the coefficients is not.
I found a book matching your question: Audio digital signal processing in real time
I browsed through it; it seems to have the right answers.

Resources