I don't want sound-to-text software. What I need is the following:
I'll record multiple (say 50+) audio streams (recordings of radio stations)
from that recordings, I'll mark interesting audio clips - their length ranges from 2 to 60 seconds - there will be few thousands of such audio clips
library should be able to find other instances of same audio clips from recorded sound streams
confidence factor should be reported to used and additional input provided so the recognition could perform better next time
Do you know of such software library? LGPL would be most valuable to me, but I can go for commercial license as well.
Audio clips will contain both music, text, effects, or any combination thereof. So, TEXT recognition is out of the question.
Architecture: c++, C# for glue, CUDA if possible.
I have not found any libraries (yet), but two interesting papers, which may give you terminology and background to refine your searches:
Audio Fingerprinting for Broadcast Streams
Audio Segment Retrieval using HMM
EDIT: Searching for "Audio fingerprinting" came to a page of implementations, both open source and commercial.
http://wiki.musicbrainz.org/AudioFingerprint
Picard seems to be well established, and could be useful if your clips contain music.
Here is an introduction to Audio fingerprinting
What you are describing is a matched filter and all you need is a cross-correlation function which should be part of any reasonable DSP library. Depending upon your choice of processor architecture and language you may even be able to find a vectorized library that can perform this operation more efficiently.
If you don't really care about performance you could use Python...
$ python
Python 2.6.4 (r264:75706, Dec 7 2009, 18:45:15)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import scipy
>>> interesting_clip = [ 5, 7, 2, 1]
>>> full_stream = [ 1, 5, 7, 2, 1, 4, 3, 2, 4, 7, 1, 2, 2, 5, 1]
>>> correlation = scipy.correlate (full_stream, interesting_clip)
>>> print correlation
[56 79 55 28 41 49 44 53 73 48 28 35]
>>> for offset, value in enumerate(correlation) :
... if (value > 60) :
... print "match at position", offset, "with value of", value
...
match at position 1 with value of 79
match at position 8 with value of 73
My threshold above is arbitrarily. You should experimentally determine what is appropriate for you.
Keep in mind that the longer your "interesting clip", the longer it will take to compute the correlation. While longer clips will help actual matches stand out better from non-matches, you probably won't need more than a few seconds.
AudioDB is an open source c++ project that searches for similar sections of audio, and handles noisy streams, and can give you a measure of similarity. It can be run as client/server, but I believe you can do a standalone program.
The other answers about dsp correlation are kind of correct, but in general these dsp algorithms want to compare two streams of the same length, which have the similar parts overlapping.
What you need requires it to work on arbitrary segments of the stream; this is what AudioDB was built for. (One application is to find hidden references/sampling or blatant copyright misuse.) I've used it for finding sounds that were played backwards, and it also finds the case where some noise or speech changes are introduced.
Note that it is still under development even though the dates on the home page seem to off. I would subscribe to the mailing list and ask what the current state is and how you might go about incorporating it.
You might want to look at this paper by Li-Chun Wang regarding www.shazam.com.
It is not an API but it does give details of how their algorithm was developed.
Take a look at the Microsoft Speech API (SAPI):
http://msdn.microsoft.com/en-us/library/ee125077%28VS.85%29.aspx
All the other requirements you listed are basically implementation details that you'll have to implement on your own. For example, as the software interprets the audio streams, it can store them in SQL server with full text indexing ... from that you do the searches to find similar/same audio clips.
There are of course other ways to implement that, and this is but one idea :-)
I would go somewhere in line with Tim Kryger's answer and use simple statistical correlation functions, as you want to stay content-agnostic.
As for the features I would definately try MFCC as it's used both in speech processing and music recognition (genres, songs). You can find MFCC and a wealth of other audio features available in the excellent open source Vamp plugins (or its more high-level bundle, a program called Sonic Annotator) or alternatively in the Marsyas framework.
Related
I am using the command line tool aubiopitch to analyze voice recordings. My goal is to determine the fundamental frequency of the voice recorded. I know, of course, that the frequency varies – that's why I want to calculate an "average" in Hz over a 30-second recording.
My question: aubio uses different methods to determine the pitch of a recording: Schmitt trigger, harmonic comb, yin, yinfft etc. Which one of those would be my preferred choice when dealing with pure human voice recordings (no background music, atmo etc.).
I would recommend using yinfast or yinfft (default). For a discussion of the algorithms, their parameters, and their performance, see Chapter 3 of this document.
Note that the median is better suited than the average in this case.
CREPE is good and outperforms many others since it uses advanced neural-network machine learning for pitch prediction. It might be unstable in unseen conditions though and might not be very easy to plug since it requires tensorflow.
For more traditional and lightweight solution oyu can try REAPER.
I have many audio files with clean audio and only spoken voice in Mandarin Chinese. I need to estimate of how many syllables are spoken in each file. Is there a tool for OS X, Windows, or Linux that can estimate these?
sample01.wav 15
sample02.wav 8
sample03.wav 5
sample04.wav 1
sample05.wav 18
As there are many files, command-line or batch-capable software is preferred, e.g.:
$ application sample01.wav
15
A solution that uses speech-to-text, then counts the number of characters present would be suitable to.
The automatic segmentation of speech is an active scientific domain, meaning that there is no method that works perfectly.
In 2009, de Jong and Wempe proposed a method to automatically detect syllables in a human speech signal using Praat. This methods compares well with man-made segmentation, and has been employed in many third-party scientific studies. You can find a detailed description of the method in their scientific article (pdf), along with an historical perspective on previously proposed methods. The Praat script per se and a couple of tutorials can be found on a dedicated website (www - speechrate).
You may also be interested in another segmentation algorithm developed by Harma that has been implemented in Matlab (Harma Syllable Segmentation)
You can use formants to determine this. Each syllable should correspond to a formant. Here is more information on formants:
https://en.wikipedia.org/wiki/Formants
This might be of interest for you
http://sites.google.com/site/speechrate/
Your question requires specific attention and solution for Speech to Text.
I really doubt any free open source library, easily available and serving to purpose will be served.
I have used one but for reverse purpose "text to speech".
Though this is not a free library, i would love to help just Google "annosoft lipsync"...
http://www.annosoft.com/lipsync-sdks
This library is available for SDK evaluation as well....
Is it possible with FFT to find a drum solo, or a drum break, in an audio file? Is this something FFT is able to do and are there any resources online that could aid me with learning?
In general, a FFT is not a good choice for detecting the onset of percussion sounds:
An FFT is always calculated over a window of samples (in effect a period of time) and yields the magnitude of signal within the bin and its phase offset. You can therefore determine that there is signal at that particular bin, but not its onset time. The best time resolution available is the window period. Of course, you can make the period shorter at the expense of frequency resolution.
Percussion sounds tend to look like noise and spread across the spectrum. This would be OK if you only had percussions sounds, but is not great in real-life polyphonic content.
However, you might be able to find some inference from the different characteristics of the spectra of a drum solo vs instrumental sections of a track.
The problem of finding the time at which percussion sounds start in music is described in academic journals as onset dectection and is one of the many techniques used for feature extraction; the wider field is known as Music Information Retrieval. Your problem sounds like one of identifying sections in audio files and this might be described as partitioning
A good place to start is Sonic Visualiser which is a tool written specifically for MIR applications. Plug-ins exist for various types of feature extraction. From these you will be able to easily find the large body of academic work in this area. There is an added bonus that the existing plug-ins are all open source too.
I'd look here, there was a bit of discussion with great pointers on the Gamedev SE: https://gamedev.stackexchange.com/questions/9761/beat-detection-and-fft :-)
I am looking for a toolkit or library to search contents of audio files for am audio sample.
For example I have 5 seconds of speech that I know it exists in hundreds of hours of audio, and I want to find exact file and position of this sub-samples.
The sample is %99 similar but maybe converted to different audio format so it may have minor differences in waveform.
I prefer .NET library if there is such an option.
Thank you.
What you are trying to do is not an easy DSP problem to solve, and there is no one foolproof method. There is however an excellent recent article on audio fingerprinting on codeproject which goes into some depth on an algorithm that searches for duplicate MP3s, with code in C#. You may be able to adapt the algorithm to your needs.
I want to differentiate betwen the male n female voices in an audio file and seperate them.As an output I want the two voices seperated.Can u please help me out n can the coding be done in java or c++
This is potentially a very complicated question, and it is similar to writing your own speech recognition (or identification) algorithm.
You would start by converting the audio into the frequency domain, which is done using a Fast Fourier Transform.
For each slice in time that you take an FFT, this will give you a list of frequencies and their amplitudes. You will somehow need to detect the fundamental tone by analysing the harmonics. The 2nd and 3rd harmonics will be clearest. It's very hard to figure out which harmonics they are, especially with the background noise and the natural difference between people's voices in terms of which harmonics are loudest. Then you can try to determine if the speaker is male or female by whatever you guessed the fundamental tone to be.
Keep in mind that during many parts of speech like sibilance ('s', 't', etc) there is no tone, just noise. It will need to be pretty intelligent.
Hope that sets you in the right general direction.
Note: if the two voices are simultaneous and you want to separate them cleanly, then this won't help you. I don't believe anyone alive has solved such a problem.
I think this is already possible. I just started taking an on-line course on Machine Learning by Stanford University with professor Andrew Ng, and during the first lecture he shows a demo where an audio recording of two overlapping voices is processed and the individual voices extracted (the same with music in the background and a person speaking). Apparently it uses an unsupervised learning algorithm that allows it to extract the two underlying patterns. You may want to look into that course (there's one version of the course here: http://www.academicearth.org/courses/machine-learning)
One such tool that makes this possible is LIUM spkdiarization. Written in Java and available under GPL, it is a speech recognition tool and uses statistical models for male, female and child. Luckily for you, the models are provided and you can use it without having to tag the recordings and train the models.
See the scripting page of the LIUM wiki for examples, search in page for "gender".
I would start by saying this is impossible. Speech recognition is really, really hard.
You're not clear in your question - are the voices overlapping? If so, splitting them up will be absurdly difficult.
If they are separate, your more likely bet is to have a large set of samples of male and female voices, and look for common characteristics (and a way to programmatically identify them). If the samples aren't recorded cleanly (if they have background noise), things get even more complicated.
You may get away with an average tone - male voices are generally deeper than female..
What you are asking is one hell of a task. thomasrutter wrote some "pointers" how to do it - but, i guess the algorithm would have to be really really robust if you would wish to use it everywhere (in all sorts of music (with singing of course)). Maybe it would be better/easier to start with separating (spliting) a single instrument sample from the song.