Speaker Recognition [closed] - audio

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
How could I differentiate between two people speaking? As in if someone says "hello" and then another person says "hello" what kind of signature should I be looking for in the audio data? periodicity?
Thanks a lot to anyone who can answer this!

The solution to this problem lies in Digital Signal Processing (DSP). Speaker recognition is a complex problem which brings computers and communication engineering to work hand in hand. Most techniques of speaker identification require signal processing with machine learning (training over the speaker database and then identification using training data). The outline of algorithm which may be followed -
Record the audio in raw format. This serves as the digital signal which needs to be processed.
Apply some pre-processing routines over the captured signal. These routines could be simply signal normalization, or filtering the signal to remove noise (using band pass filters for normal frequency range of human voice. Band pass filters can in turn be created using a low pass and a high pass filter in combination.)
Once it is fairly certain that the captured signal is pretty much free from noise, feature extraction phase begins. Some of the known techniques which are used for extracting voice features are - Mel-Frequency Cepstral Coefficients (MFCC), Linear Predictive Coding (LPC) or simple FFT features.
Now, there are two phases - training and testing.
First the system needs to be trained over the voice features of different speakers before it is capable to distinguish between them. In order to ensure that the features are correctly calculated, it is recommended that several (>10) samples of voice from speakers must be collected for training purposes.
Training can be done using different techniques like neural networks or distance based classification to find the differences in the features of voices from different speakers.
In testing phase, the training data is used to find the voice feature set which lies at the lowest distance from the signal being tested. Different distances like Euclidean or Chebyshev distances might be used to calculate this proximity.
There are two open source implementations which enable speaker identification - ALIZE: http://mistral.univ-avignon.fr/index_en.html and MARF: http://marf.sourceforge.net/.
I know its a bit late to answer this question, but I hope someone finds it useful.

This is an extremely hard problem, even for experts in speech and signal processing. This page has much more information: http://en.wikipedia.org/wiki/Speaker_recognition
And some suggested technology starting points:
The various technologies used to
process and store voice prints include
frequency estimation, hidden Markov
models, Gaussian mixture models,
pattern matching algorithms, neural
networks, matrix representation,Vector
Quantization and decision trees. Some
systems also use "anti-speaker"
techniques, such as cohort models, and
world models.

Having only two people to differentiate, if they are uttering the same word or phrase will make this much easier. I suggest starting with something simple, and only adding complexity as needed.
To begin, I'd try sample counts of the digital waveform, binned by time and magnitude or (if you have the software functionality handy) an FFT of the entire utterance. I'd consider a basic modeling process first, too, such as linear discriminant (or whatever you already have available).

Another way to go is to use an array of microphones and differentiate between the postions and directions of the vocal sources. I consider this to be a easier approach since the position calculation is much less complicated than separating different speakers from a mono or stereo source.

Related

How do face recognition systems differentiate between a real face and a photo of a face?

I'm working on face recognition project with python & OpenCV I detect faces but I have that problem
I don't know how to get t make the system differentiating between real and fake faces with 2D image
if someone has any ideas, please help me.
thank you.
There is a really good article (code included) by Adrian from pyimagesearch tackling the same exact problem with liveness detector.
Below is the extract from that article
There are a number of approaches to liveness detection, including:
Texture analysis, including computing Local Binary Patterns (LBPs) over face regions and using an SVM to classify the faces as real or spoofed.
Frequency analysis, such as examining the Fourier domain of the face.
Variable focusing analysis, such as examining the variation of pixel values between two consecutive frames.
Heuristic-based algorithms, including eye movement, lip movement, and blink detection. These set of algorithms attempt to track eye movement and blinks to ensure the user is not holding up a photo of another person (since a photo will not blink or move its lips).
Optical Flow algorithms, namely examining the differences and properties of optical flow generated from 3D objects and 2D planes.
-3D face shape, similar to what is used on Apple’s iPhone face recognition system, enabling the face recognition system to distinguish between real faces and printouts/photos/images of another person.
Combinations of the above, enabling a face recognition system engineer to pick and choose the liveness detections models appropriate for their particular application.
You can solve this problem using multiple methods, I'm listing some of them here, you can find a few more by referring to some research papers.
Motion Approach: You can make user blink or move which convinces a way that they are real (Most likely to work on video dataset or sequential images)
Feature Approach: Extract useful features from an Image and use them to make binary classification decisions to say real or not.
Frequency Analysis: Examining the Fourier domain of the face.
Optical Flow algorithms: Namely examining the differences and properties of optical flow generated from 3D objects and 2D planes.
Texture Analysis: You can also do Local Binary Patterns using OpenCV to classify the images fake or not, refer this link for details on this approach.

Extraction of sound features for goodness of pronunciation evaluation

I'm working on concept of Mobile application for children logopaedic exercises (goodness of pronunciation evaluation). In first iteration we want implement evaluation of correct pronunciation of one isolated consonant (russian equivalent of English “sh” [ʃ] sound). Result could be “correct” or “incorrect” (better points, e.g. from 1 to 5).
We have ~50 samples recorded by speech therapists and marked in 5 points quality measure. Each sample contains separate sound (0.5-2 seconds). We can get more samples in future.
In general, I split this problem in following steps:
Preprocess sound signal (reduce noise, amplify/attenuate, remove silent periods);
Extract proper signal features which are correlated with consonant pronunciation quality. Features are vector of numbers produced from sound chunk (frame). Feature candidates: frequency spectrum of a sound, MFCC coefficients, amplitude spectrum,... Another question is feature frame size (time duration).
Use some classification algorithm ("Machine learning" in general) to make classification based on features from sound training set.
The main problem I stacked with is lack of methodology how to extract features.
I have tried to use the MFCC approach, but it seems, that feature vector depends more on sound intensity variation during sample (Frankly, I did that conclusions just looking on plots of MFCC coefficients like https://drive.google.com/file/d/0BzBavyZHrcMlS0xLQ2phbmxoRVk/view?usp=sharing where X values are 13 MFCC coefficients and each line represents one sound frame of 25 ms).
I am not sure in pure spectrum characteristics, because of noise nature of consonants.
A lot of papers and blog posts describes problem of speech recognition in word and utterance context. My intuition says that I need different approach for my problem.
Examples of good features for similar tasks and general methodology of features evaluation will be both usable for me. Thanks.

How can audio data be abstracted for comparison purposes?

I am working on a project involving machine learning and data comparison.
For the purpose of this project, I am feeding abstracted video data to a neuronal network.
Now, abstracting image data is quite simple. I can take still-frames at certain points in the video, scale them down into 5 by 5 pixels (or any other manageable resolution) and get the pixel values for analysis.
The resulting data gives a unique, small and somewhat data-rich sample (even 5 samples of 5x5 px are enough to distinguish a drama from a nature documentary, etc).
However, I am stuck on the audio part. Since audio consists of samples and each sample by itself has no inherent meaning, I can't find a way to abstract audio down into processable blocks.
Are there common techniques for this process? If not, what metrics can audio data be quantified and abstracted in?
The process you require is audio feature extraction. A large number of feature detection algorithms exist, usually specialising in signals that are music or speech.
For music, chromacity, rhythm, harmonic distribution are all features you might extract - along with many more.
Typically, audio feature extraction algorithms work at a fairly macro level - that is to say thousands of samples at a time.
A good place to get started is Sonic visualiser which is a plug-in host for audio visualisation algorithms - many of which are feature extractors.
YAAFE may also have some useful stuff in it.

Audio signal source separation with neural network

What I am trying to do is separating the audio sources and extract its pitch from the raw signal.
I modeled this process myself, as represented below:
Each sources oscillate in normal modes, often makes its component peaks' frequency integer multiplication. It's known as Harmonic. And then resonanced, finally combined linearly.
As seen in above, I've got many hints in frequency response pattern of audio signals, but almost no idea how to 'separate' it. I've tried countless of my own models. This is one of them:
FFT the PCM
Get peak frequency bins and amplitudes.
Calculate pitch candidate frequency bins.
For each pitch candidates, using recurrent neural network analyze all the peaks and find appropriate combination of peaks.
Separate analyzed pitch candidates.
Unfortunately, I've got non of them successfully separates the signal until now.
I want any of advices to solve these kind of problem.
Especially in modeling of source separation like my one above.
Because no one has really attempted to answer this, and because you've marked it with the neural-network tag, I'm going to address the suitability of a neural network to this kind of problem. As the question was somewhat non-technical, this answer will also be "high level".
Neural networks require some sort of sample set from which to learn. In order to "teach" a neural net to solve this problem you would essentially need to have a working set of known solutions to work from. Do you have this? If so, read on. If not, a neural is probably not what you are seeking. You stated that you have "many hints" but no real solution. This leads me to believe you probably don't have sample sets. If you can get them, great, otherwise you might be out of luck.
Supposing now that you have a sample set of Raw Signal samples and corresponding Source 1 and Source 2 outputs... Well, now you're going to need a method for deciding on a topology. Assuming you don't know a lot about how neural nets work (and don't want to), and assuming you also don't know the exact degree of complexity of the problem, I would probably recommend the open source NEAT package to get you started. I am not affiliated in any way with this project, but I have used it, and it allows you to (relatively) intelligently evolve neural network topologies to fit the problem.
Now, in terms of how a neural net would solve this specific problem. The first thing that comes to mind is that all audio signals are essentially time-series. That is to say, the information they convey is somehow dependent and related to the data at previous timesteps (e.g. the detection of some waveform cannot be done from a single time-point; it requires information about previous timesteps as well). Again, there's a million ways of solving this problem, but since I'm already recommending NEAT I'd probably suggest you take a look at the C++ NEAT Time Series mod.
If you're going down this route, you'll probably be wanting to use some sort of sliding window to provide information about the recent past at each time step. For a quick and dirty intro to sliding windows, check out this SO question:
Time Series Prediction via Neural Networks
The size of the sliding window can be important, especially if you're not using recurrent neural nets. Recurrent networks allow neural nets to remember previous time steps (at the cost of performance - NEAT is already recurrent so that choice is made for you here). You will probably want the sliding window length (ie. the number of timesteps in the past provided at every time step) to be roughly equal to your conservative guess of the largest number of previous timesteps required to gain enough information to split your waveform.
I'd say this is probably enough information to get you started.
When it comes to deciding how to provide the neural net with the data, you'll first want to normalise the input signals (consider a sigmoid function) and experiment with different transfer functions (sigmoid would probably be a good starting point).
I would imagine you'll want to have 2 output neurons, providing normalised amplitude (which you would denormalise via the inverse of the sigmoid function) as the output representing Source 1 and Source 2 respectively. For the fitness value (the way you judge the ability of each tested network to solve the problem) would be something along the lines of the negative of the RMS error of the output signal against the actual known signal (ie. tested against the samples I was referring to earlier that you will need to procure).
Suffice to say, this will not be a trivial operation, but it could work if you have enough samples to train the network against. What is a good number of samples? Well as a rule of thumb it's roughly a number that is large enough such that a simple polynomial function of order N (where N is the number of neurons in the netural network you require to solve the problem) cannot fit all of the samples accurately. This is basically to ensure you are not simply overfitting the problem, which is a serious issue with neural networks.
I hope this has been helpful! Best of luck.
Additional note: your work to date wouldn't have been in vain if you go down this route. A neural network is likely to benefit from additional "help" in the form of FFTs and other signal modelling "inputs", so you might want to consider taking the signal processing you have already done, organising into an analog, continuous representation and feeding it as an input alongside the input signal.

Real time pitch detection

I'm trying to do real time pitch detection of a users singing, but I'm running into alot of problems. I've tried lots of methods, including FFT (FFT Problem (Returns random results)) and autocorrelation (Autocorrelation pitch detection returns random results with mic input), but I can't seem to get any methods to give a good result. Can anyone suggest a method for real-time pitch tracking or how to improve on a method I already have? I can't seem to find any good C / C++ methods for real time pitch detection.
Thanks,
Niall.
Edit: Just to note, i've checked that the mic input data is correct, and that when using a sine wave the results are more or less the correct pitch.
Edit: Sorry this is late, but at the moment, im visualizing the autocolleration by taking the values out of the results array, and each index, and plotting the index on the X axis and the value on the Y axis (both are divided by 100000 or something, and im using OpenGL), plugging the data into a VST host and using VST plugins isn't an option to me. At the moment, it just looks like some random dots. Am i doing it correctly, or can you please point me torwards some code for doing it or help me understand how to visualize the raw audio data and autocorrelation data.
Taking a step back... To get this working you MUST figure out a way to plot intermediate steps of this process. What you're trying to do is not particularly hard, but it is error prone and fiddly. Clipping, windowing, bad wiring, aliasing, DC offsets, reading the wrong channels, the weird FFT frequency axis, impedance mismatches, frame size errors... who knows. But if you can plot the raw data, and then plot the FFT, all will become clear.
I found several open source implementations of real-time pitch tracking
dywapitchtrack uses a wavelet-based algorithm
"Realtime C# Pitch Tracker" uses a modified autocorrelation approach now removed from Codeplex - try searching on GitHub
aubio (mentioned by piem; several algorithms are available)
There are also some pitch trackers out there which might not be designed for real-time, but may be usable that way for all I know, and could also be useful as a reference to compare your real-time tracker to:
Praat is an open source package sometimes used for pitch extraction by linguists and you can find the algorithm documented at http://www.fon.hum.uva.nl/paul/praat.html
Snack and WaveSurfer also contain a pitch extractor
I know this answer isn't going to make everyone happy but here goes.
This stuff is hard, very hard. Firstly go read as many tutorials as you can find on FFT, Autocorrelation, Wavelets. Although I'm still struggling with DSP I did get some insights from the following.
https://www.coursera.org/course/audio the course isn't running at the moment but the videos are still available.
http://miracle.otago.ac.nz/tartini/papers/Philip_McLeod_PhD.pdf thesis about the development of a pitch recognition algorithm.
http://dsp.stackexchange.com a whole site dedicated to digital signal processing.
If like me you didn't do enough maths to completely follow the tutorials don't give up as some of the diagrams and examples still helped me to understand what was going on.
Next is test data and testing. Write yourself a library that generates test files to use in checking your algorithm/s.
1) A super simple pure sine wave generator. So say you are looking at writing YAT(Yet Another Tuner) then use your sine generator to create a series of files around 440Hz say from 420-460Hz in varying increments and see how sensitive and accurate your code is. Can it resolve to within 5Hz, 1Hz, finer still?
2) Then upgrade your sine wave generator so that it adds a series of weaker harmonics to the signal.
3) Next are real world variations on harmonics. So whilst for most stringed instruments you'll see a series of harmonics as simple multiples of the fundamental frequency F0, for instruments like clarinets and flutes because of the way the air behaves in the chamber the even harmonics will be missing or very weak. And for some instruments F0 is missing but can be determined from the distribution of the other harmonics. F0 being what the human ear perceives as pitch.
4) Throw in some deliberate distortion by shifting the harmonic peak frequencies up and down in an irregular manner
The point being that if you are creating files with known results then its easier to verify that what you are building actually works, bugs aside of course.
There are also a number of "libraries" out there containing sound samples.
https://freesound.org from the Coursera series mentioned above.
http://theremin.music.uiowa.edu/MIS.html
Next be aware that your microphone is not perfect and unless you have spent thousands of dollars on it will have a fairly variable frequency response range. In particular if you are working with low notes then cheaper microphones, read the inbuilt ones in your PC or Phone, have significant rolloff starting at around 80-100Hz. For reasonably good external ones you might get down to 30-40Hz. Go find the data on your microphone.
You can also check what happens by playing the tone through speakers and then recording with you favourite microphone. But of course now we are talking about 2 sets of frequency response curves.
When it comes to performance there are a number of freely available libraries out there although do be aware of the various licensing models.
Above all don't give up after your first couple of tries. Best of luck.
Here's the C++ source code for an unusual two-stage algorithm that I devised which can do Realtime Pitch Detection on polyphonic MP3 files while being played on Windows. This free application (PitchScope Player, available on web) is frequently used to detect the notes of a guitar or saxophone solo upon a MP3 recording. The algorithm is designed to detect the most dominant pitch (a musical note) at any given moment in time within a MP3 music file. Note onsets are accurately inferred by a significant change in the most dominant pitch (a musical note) at any given moment during the MP3 recording.
When a single key is pressed upon a piano, what we hear is not just one frequency of sound vibration, but a composite of multiple sound vibrations occurring at different mathematically related frequencies. The elements of this composite of vibrations at differing frequencies are referred to as harmonics or partials. For instance, if we press the Middle C key on the piano, the individual frequencies of the composite's harmonics will start at 261.6 Hz as the fundamental frequency, 523 Hz would be the 2nd Harmonic, 785 Hz would be the 3rd Harmonic, 1046 Hz would be the 4th Harmonic, etc. The later harmonics are integer multiples of the fundamental frequency, 261.6 Hz ( ex: 2 x 261.6 = 523, 3 x 261.6 = 785, 4 x 261.6 = 1046 ). Linked at the bottom, is a snapshot of the actual harmonics which occur during a polyphonic MP3 recording of a guitar solo.
Instead of a FFT, I use a modified DFT transform, with logarithmic frequency spacing, to first detect these possible harmonics by looking for frequencies with peak levels (see diagram below). Because of the way that I gather data for my modified Log DFT, I do NOT have to apply a Windowing Function to the signal, nor do add and overlap. And I have created the DFT so its frequency channels are logarithmically located in order to directly align with the frequencies where harmonics are created by the notes on a guitar, saxophone, etc.
Now being retired, I have decided to release the source code for my pitch detection engine within a free demonstration app called PitchScope Player. PitchScope Player is available on the web, and you could download the executable for Windows to see my algorithm at work on a mp3 file of your choosing. The below link to GitHub.com will lead you to my full source code where you can view how I detect the harmonics with a custom Logarithmic DFT transform, and then look for partials (harmonics) whose frequencies satisfy the correct integer relationship which defines a 'pitch'.
My Pitch Detection Algorithm is actually a two-stage process: a) First the ScalePitch is detected ('ScalePitch' has 12 possible pitch values: {E, F, F#, G, G#, A, A#, B, C, C#, D, D#} ) b) and after ScalePitch is determined, then the Octave is calculated by examining all the harmonics for the 4 possible Octave-Candidate notes. The algorithm is designed to detect the most dominant pitch (a musical note) at any given moment in time within a polyphonic MP3 file. That usually corresponds to the notes of an instrumental solo. Those interested in the C++ source code for my Two-Stage Pitch Detection algorithm might want to start at the Estimate_ScalePitch() function within the SPitchCalc.cpp file at GitHub.com.
https://github.com/CreativeDetectors/PitchScope_Player
Below is the image of a Logarithmic DFT (created by my C++ software) for 3 seconds of a guitar solo on a polyphonic mp3 recording. It shows how the harmonics appear for individual notes on a guitar, while playing a solo. For each note on this Logarithmic DFT we can see its multiple harmonics extending vertically, because each harmonic will have the same time-width. After the Octave of the note is determined, then we know the frequency of the Fundamental.
I had a similar problem with microphone input on a project I did a few years back - turned out to be due to a DC offset.
Make sure you remove any bias before attempting FFT or whatever other method you are using.
It is also possible that you are running into headroom or clipping problems.
Graphs are the best way to diagnose most problems with audio.
Take a look at this sample application:
http://www.codeproject.com/KB/audio-video/SoundCatcher.aspx
I realize the app is in C# and you need C++, and I realize this is .Net/Windows and you're on a mac... But I figured his FFT implementation might be a starting reference point. Try to compare your FFT implementation to his. (His is the iterative, breadth-first version of Cooley-Tukey's FFT). Are they similar?
Also, the "random" behavior you're describing might be because you're grabbing data returned by your sound card directly without assembling the values from the byte-array properly. Did you ask your sound card to sample 16 bit values, and then gave it a byte-array to store the values in? If so, remember that two consecutive bytes in the returned array make up one 16-bit audio sample.
Java code for a real-time real detector is available at http://code.google.com/p/freqazoid/.
It works fairly well on any computer running post-2008 real-time Java. The project has been dropped and could be picked up by any interested party. Contact me if you want further details.
Check out aubio, and open source library which includes several state-of-the-art methods for pitch tracking.
I have asked a similar question here:
C/C++/Obj-C Real-time algorithm to ascertain Note (not Pitch) from Vocal Input
EDIT:
Performous contains a C++ module for realtime pitch detection
Also Yin Pitch-Tracking algorithm
You could do real time pitch detection, be it of a singer's voice, with TarsosDSP
https://github.com/JorenSix/TarsosDSP
just in case anyone hasn't heard of it yet :-)
Can you adapt anything from instrument tuners? My delightfully compact guitar tuner is able to detect the pitch of the strings pretty well. I see this reference to a piano tuner which explains an algorithm to some extent.
Here are some open source libraries that implement pitch detection:
WORLD : speech analysis/synthesis toolkit. This is especially suitable if your source signal is voice.
aubio : audio feature extraction library. Implements many pitch detection algorithms.
Pitch detection : a collection of pitch detection algorithms implemented in C++.
dywapitchtrack : a high quality pitch detection algorithm.
YIN : another implementation of the YIN algorithm in a single C++ source file.

Resources