I am using the command line tool aubiopitch to analyze voice recordings. My goal is to determine the fundamental frequency of the voice recorded. I know, of course, that the frequency varies – that's why I want to calculate an "average" in Hz over a 30-second recording.
My question: aubio uses different methods to determine the pitch of a recording: Schmitt trigger, harmonic comb, yin, yinfft etc. Which one of those would be my preferred choice when dealing with pure human voice recordings (no background music, atmo etc.).
I would recommend using yinfast or yinfft (default). For a discussion of the algorithms, their parameters, and their performance, see Chapter 3 of this document.
Note that the median is better suited than the average in this case.
CREPE is good and outperforms many others since it uses advanced neural-network machine learning for pitch prediction. It might be unstable in unseen conditions though and might not be very easy to plug since it requires tensorflow.
For more traditional and lightweight solution oyu can try REAPER.
Related
I just saw a paper by Cornell reconstructing faces from sound. But I am more interested in the timbre. It might be attacked with AI, but is there an easier way? For example, is instrument a going to be on a different range than instrument b.
For the most part, instruments are going to have overlapping frequency content. IDK the specific algorithms for isolating instruments--I've heard they do exist. I would think that a big element is not just tracking all the harmonics and frequency content, but looking for correspondences in volume changes or frequency changes of the different frequencies, in order to determine which frequencies should be grouped together as a single instrument. Since instruments often play the same notes at the same time, this would be no mean feat. If you are a beginner with digital signal processing, can I recommend "The Scientists and Engineers Guide to DSP" by Steve Smith? (Free download, good book on the fundamental knowledge needed to tackle such a project.)
Is it possible with FFT to find a drum solo, or a drum break, in an audio file? Is this something FFT is able to do and are there any resources online that could aid me with learning?
In general, a FFT is not a good choice for detecting the onset of percussion sounds:
An FFT is always calculated over a window of samples (in effect a period of time) and yields the magnitude of signal within the bin and its phase offset. You can therefore determine that there is signal at that particular bin, but not its onset time. The best time resolution available is the window period. Of course, you can make the period shorter at the expense of frequency resolution.
Percussion sounds tend to look like noise and spread across the spectrum. This would be OK if you only had percussions sounds, but is not great in real-life polyphonic content.
However, you might be able to find some inference from the different characteristics of the spectra of a drum solo vs instrumental sections of a track.
The problem of finding the time at which percussion sounds start in music is described in academic journals as onset dectection and is one of the many techniques used for feature extraction; the wider field is known as Music Information Retrieval. Your problem sounds like one of identifying sections in audio files and this might be described as partitioning
A good place to start is Sonic Visualiser which is a tool written specifically for MIR applications. Plug-ins exist for various types of feature extraction. From these you will be able to easily find the large body of academic work in this area. There is an added bonus that the existing plug-ins are all open source too.
I'd look here, there was a bit of discussion with great pointers on the Gamedev SE: https://gamedev.stackexchange.com/questions/9761/beat-detection-and-fft :-)
I want to take two sounds that contain a dominant frequency and say 'this one is higher than this one'. I could do FFT, find the frequency with the greatest amplitude of each and compare them. I'm wondering if, as I have a specific task, there may be a simpler algorithm.
The sounds are quite dirty with many frequencies, but contain a clear dominant pitch. They aren't perfectly produced sine waves.
Given that the sounds are quite dirty, I would suggest starting to develop the algorithm with the output of an FFT as it'll be much simpler to diagnose any problems. Then when you're happy that it's working you can think about optimising/simplifying.
As a rule of thumb when developing this kind of numeric algorithm, I always try to operate first in the most relevant domain (in this case you're interested in frequencies, so analyse in frequency space) at the start, and once everything is behaving itself consider shortcuts/optimisations. That way you can test the latter solution against the best-performing former.
In the general case, decent pitch detection/estimation generally requires a more sophisticated algorithm than looking at FFT peaks, not a simpler algorithm.
There are a variety of pitch detection methods ranging in sophistication from counting zero-crossing (which obviously won't work in your case) to extremely complex algorithms.
While the frequency domain methods seems most appropriate, it's not as simple as "taking the FFT". If your data is very noisy, you may have spurious peaks that are higher than what you would consider to be the dominant frequency. One solution is use window overlapping segments of your signal, and do STFTs, and average the results. But this raises more questions: how big should the windows be? In this case, it depends on how far apart you expect those dominant peaks to be, how long your recordings are, etc. (Note: FFT methods can resolve to better than one-bin size by taking into account phase information. In this case, you would have to do something more complex than averaging all your FFT windows together).
Another approach would be a time-domain method, such as YIN:
http://recherche.ircam.fr/equipes/pcm/cheveign/pss/2002_JASA_YIN.pdf
Wikipedia discusses some more methods:
http://en.wikipedia.org/wiki/Pitch_detection_algorithm
You can also explore some more methods in chapter 9 of this book:
http://www.amazon.com/DAFX-Digital-Udo-ouml-lzer/dp/0471490784
You can get matlab sourcecode for yin from chapter 9 of that book here:
http://www2.hsu-hh.de/ant/dafx2002/DAFX_Book_Page_2nd_edition/matlab.html
Can anyone provide sample pseudocode or share some existing link that has sample code.
Like for example I have a mix audio of 1kHz or 2kHz or 8kHz or so, and I want to boost certain frequencies like 1kHz only in real-time.
Reading some DSP books and resources confuses me.
You just need to design and implement a suitable digital filter. This is a large and complex subject area though, so you won't get a simple answer here. Probably the best thing as a first step would be to read a good introductory book on DSP, e.g. Understanding DSP by Rick Lyons, which is a very good for beginners as it's not too heavy on the math and has a more practical bent than most such introductory DSP books.
For this particular application though what you are trying to do is similar to implementing a graphic equalizer, and there are many pointers to how to implement this kind of thing if you use e.g. "graphic equalizer" as a search term.
There's a lot of math behind digital filtering. Sorry, I think it is important to at least understand basic filters (like those used in electronics). If you don't want to go through the basics: best to get an audio graphics equaliser where you can play with the (virtual) sliders. If you want to implement a very specific filter, please read on.
Real time: depends on your computing platform. If this is a small micro (like AVR, Microchip PIC,..) you'll need an efficient algorithm. This is likely a IIR band pass filter. The equivalent of a graphics equaliser consists of multiple band pass filters, all summed together. See http://en.wikipedia.org/wiki/Infinite_impulse_response
A more computing intensive algorithm uses FIR filters. In that case you can also control the phase of the filtered signal. http://en.wikipedia.org/wiki/Finite_impulse_response
If you find an algorithm (i.e. IIR), you'll need to calculate the coefficients. The algorithm is simple, calculating the coefficients is not.
I found a book matching your question: Audio digital signal processing in real time
I browsed through it; it seems to have the right answers.
I want to differentiate betwen the male n female voices in an audio file and seperate them.As an output I want the two voices seperated.Can u please help me out n can the coding be done in java or c++
This is potentially a very complicated question, and it is similar to writing your own speech recognition (or identification) algorithm.
You would start by converting the audio into the frequency domain, which is done using a Fast Fourier Transform.
For each slice in time that you take an FFT, this will give you a list of frequencies and their amplitudes. You will somehow need to detect the fundamental tone by analysing the harmonics. The 2nd and 3rd harmonics will be clearest. It's very hard to figure out which harmonics they are, especially with the background noise and the natural difference between people's voices in terms of which harmonics are loudest. Then you can try to determine if the speaker is male or female by whatever you guessed the fundamental tone to be.
Keep in mind that during many parts of speech like sibilance ('s', 't', etc) there is no tone, just noise. It will need to be pretty intelligent.
Hope that sets you in the right general direction.
Note: if the two voices are simultaneous and you want to separate them cleanly, then this won't help you. I don't believe anyone alive has solved such a problem.
I think this is already possible. I just started taking an on-line course on Machine Learning by Stanford University with professor Andrew Ng, and during the first lecture he shows a demo where an audio recording of two overlapping voices is processed and the individual voices extracted (the same with music in the background and a person speaking). Apparently it uses an unsupervised learning algorithm that allows it to extract the two underlying patterns. You may want to look into that course (there's one version of the course here: http://www.academicearth.org/courses/machine-learning)
One such tool that makes this possible is LIUM spkdiarization. Written in Java and available under GPL, it is a speech recognition tool and uses statistical models for male, female and child. Luckily for you, the models are provided and you can use it without having to tag the recordings and train the models.
See the scripting page of the LIUM wiki for examples, search in page for "gender".
I would start by saying this is impossible. Speech recognition is really, really hard.
You're not clear in your question - are the voices overlapping? If so, splitting them up will be absurdly difficult.
If they are separate, your more likely bet is to have a large set of samples of male and female voices, and look for common characteristics (and a way to programmatically identify them). If the samples aren't recorded cleanly (if they have background noise), things get even more complicated.
You may get away with an average tone - male voices are generally deeper than female..
What you are asking is one hell of a task. thomasrutter wrote some "pointers" how to do it - but, i guess the algorithm would have to be really really robust if you would wish to use it everywhere (in all sorts of music (with singing of course)). Maybe it would be better/easier to start with separating (spliting) a single instrument sample from the song.