Sound wave features correlated with speaker voice

Sound wave features correlated with speaker voice - audio

What are a few of the simplest features of a sound wave that are correlated with differences between voices of different speakers? Where can I find how to calculate these from data?
The project we're trying to solve takes 3 hours of audio as input, each moment of which has either one of two people talking or silence. We need to determine which of these alternatives each moment corresponds to using hmm.
The results don't have to be 100% accurate. An approximation will suffice.

Related

'Mono' FFT Visualization of a Stereo Analog Audio Source

I have created a really basic FFT visualizer using a Teensy microcontroller, a display panel, and a pair of headphone jacks. I used kosme's FFT library for Arduino: https://github.com/kosme/arduinoFFT
Analog audio flows into the headphone input and to a junction where the microcontroller samples it. That junction is also connected to an audio out jack so that audio can be passed to some speakers.
This is all fine and good, but currently I'm only sampling the left audio channel. Any time music is stereo separated, the visualization cannot account for any sound on the right channel. I want to rectify this but I'm not sure whether I should start with hardware or software.
Is there a circuit I should build to mix the left and right audio channels? I figure I could do something like so:
But I'm pretty sure that my schematic is misguided. I included bias voltage to try and DC couple the audio signal so that it will properly ride over the diodes. Making sure that the output matches the input is important to me though.
Or maybe should this best be approached in software? Should I instead just be sampling both channels separately and then doing some math to combine them?

Combining the stereo channels of one end of the fork without combining the other two is very difficult. Working in software is much easier.
If you take two sets of samples, you've doubled the amount of math that the microcontroller needs to do.
But if you take readings from both pins and divide them by two, you can add them together and have one set of samples which represents the 'mono' signal.
Keep in mind that human ears have an uneven response to sound volumes, so a 'medium' volume reading on both pins, summed and halved, will result in a 'lower-medium' value. It's better to divide by 1.5 or 1.75 if you can spare the cycles for more complicated division.

Is it possible to, as accurately as possible, decompose an audio into MIDI, given the SoundFont that was used?

If I know the SoundFont that a MIDI to audio track has used, can I theoretically reverse the audio back into it's (most likely) MIDI components? If so, what would be one of the best approach to doing this?
The end goal is to try encoding audio (even voice samples) into MIDI such that I can reproduce the original audio in MIDI format better than, say, BearFileConverter. Hopefully with better results than just bandpass filters or FFT.
And no, this is not for any lossy audio compression or sheet transcription, this is mostly for my curiosity.

For monophonic music only, with no background sound, and if your SoundFont synthesis engine and your record sample rates are exactly matched (synchronized to 1ppm or better, have no additional effects, also both using a known A440 reference frequency, known intonation, etc.), then you can try using a set of cross correlations of your recorded audio against a set of synthesized waveform samples at each MIDI pitch from your a-priori known font to create a time line of statistical likelihoods for each MIDI note. Find the local maxima across your pitch range, threshold, and peak pick to find the most likely MIDI note onset times.
Another possibility is sliding sound fingerprinting, but at an even higher computational cost.
This fails in real life due to imperfectly matched sample rates plus added noise, speaker and room acoustic effects, multi-path reverb, and etc. You might also get false positives for note waveforms that are very similar to their own overtones. Voice samples vary even more from any template.
Forget bandpass filters or looking for FFT magnitude peaks, as this works reliably only for close to pure sinewaves, which very few musical instruments or interesting fonts sound like (or are as boring as).

Changing duration of Guitar pluck in PCM data

Folks,
I am struggling with a simple concept related to the duration of play of PCM data. I would appreciate your feedback.
The application I am developing plays guitar notes from a music sheet.
I have implemented Jaffe-Smith Algorithm for guitar plucking.
https://ccrma.stanford.edu/~jos/Mohonk05/Extended_Karplus_Strong_EKS_Algorithm.html.
Let's say I compute samples for note A (440 Hz) for one second.
At the sample rate of 11025, I will be storing 11025 samples that can be send to the computer speakers as PCM audio.
For all the unique notes on the guitar, it takes quite some time to compute samples for all the notes. I am thinking I will pre-compute and save them as binary data and simply load them when the application is run.
So far so good.
Now, let's say I want to play a song (a list of various notes). Let's say the song needs to be played at 100 beats per minute. Let's say I have to play note A for one beat or 0.6 seconds (60/100).
Recalculating samples for 0.6 seconds may take quite some time.
Can I simply play (11025 * 0.6) samples? Will this create any side effect?
Is there a better way to achieve what I am trying to do?
Thank you in advance for your help.
Regards,
Peter

What you're basically trying to do is create a synthesized guitar, yes? I might suggest that you go with the sampler route instead.
By sample, I mean a small clip of audio (not a single sample in the sense of ADC or DAC).
Basically, you can flatten what you need into 4 parts:
Attack
Decay
Sustain
Release
These four parts work in that order, and are generally referred to as an ADSR envelope. The attack of the note is the initial sound. For a guitar, you are going to hear a pluck and the start of a pitch. The decay is going to be the sample of the string as it starts to fade away. The sustain is a sample repeated over and over again until you release the key. The release sample is what is played when you release the key. For a guitar, you might hear a sample of lightly putting fingers back on the string to stop their vibration.
Now, you could generate all of these samples in real-time, but will likely be very CPU intensive.
Regarding your question: "Can I simply play (11025 * 0.6) samples?" Yes, at a sample rate of 11025, that will be 0.6 seconds of audio. Also keep in mind though that you should be sending a continuous stream of data to the sound card, filling any empty spots with 0 (for signed PCM).

Extracting pitch from singing voice

I'd like to extract the pitch from a singing voice. The track in question contains only a single voice and no other sounds.
I want to know the loudness and perceived pitch frequency at a given point in time. So something like the following:
0.0sec 400Hz -20dB
0.1sec 401Hz -9dB
0.2sec 403Hz -10dB
0.3sec 403Hz -10dB
0.4sec 404Hz -11dB
0.5sec 406Hz -13dB
0.6sec 410Hz -15dB
0.7sec 411Hz -16dB
0.8sec 409Hz -20dB
0.9sec 407Hz -24dB
1.0sec 402Hz -34dB
How might I achieve such an output? I'm interested in slight changes in frequency as apposed to a specific note value. I have some DSP knowledge and I can program in C++ and python but I'd like to avoid reinventing the wheel if possible.

Note that slight changes in frequency in Hz and perceived pitch may not be the same thing. Perceived pitch resolution seems to vary with absolute frequency, duration, and loudness. If you want more accuracy than this, there might be some research papers on estimating the time between each glottal closure (probably using a deconvolution or pattern matching technique), which would give you some sort of pitch period. The simplest pitch estimate might be some form of weighted autocorrelation, for which lots of canned algorithms and code is available.
Since dB is log scale, this measure might be somewhat closer to perceived loudness, but has to be spectrally weighted with some perceptual frequency response curve over some duration of measurement.
There seem to be research papers on both of these topics, as well as many textbooks on human audio perception as well as on common audio DSP techniques.

I suggest you read this article
http://audition.ens.fr/adc/pdf/2002_JASA_YIN.pdf
. This is one of the simplest methods of pitch detection, and it works very well.
Also, for measuring the instantaneous power of the signal, you can just take the absolute value of the signal and divide by 1/√2 (Gives the RMS value) and then smooth it (usually a first order low pass filter). I hope this helps. Good luck!

Real time pitch detection

I'm trying to do real time pitch detection of a users singing, but I'm running into alot of problems. I've tried lots of methods, including FFT (FFT Problem (Returns random results)) and autocorrelation (Autocorrelation pitch detection returns random results with mic input), but I can't seem to get any methods to give a good result. Can anyone suggest a method for real-time pitch tracking or how to improve on a method I already have? I can't seem to find any good C / C++ methods for real time pitch detection.
Thanks,
Niall.
Edit: Just to note, i've checked that the mic input data is correct, and that when using a sine wave the results are more or less the correct pitch.
Edit: Sorry this is late, but at the moment, im visualizing the autocolleration by taking the values out of the results array, and each index, and plotting the index on the X axis and the value on the Y axis (both are divided by 100000 or something, and im using OpenGL), plugging the data into a VST host and using VST plugins isn't an option to me. At the moment, it just looks like some random dots. Am i doing it correctly, or can you please point me torwards some code for doing it or help me understand how to visualize the raw audio data and autocorrelation data.

Taking a step back... To get this working you MUST figure out a way to plot intermediate steps of this process. What you're trying to do is not particularly hard, but it is error prone and fiddly. Clipping, windowing, bad wiring, aliasing, DC offsets, reading the wrong channels, the weird FFT frequency axis, impedance mismatches, frame size errors... who knows. But if you can plot the raw data, and then plot the FFT, all will become clear.

I found several open source implementations of real-time pitch tracking
dywapitchtrack uses a wavelet-based algorithm
"Realtime C# Pitch Tracker" uses a modified autocorrelation approach now removed from Codeplex - try searching on GitHub
aubio (mentioned by piem; several algorithms are available)
There are also some pitch trackers out there which might not be designed for real-time, but may be usable that way for all I know, and could also be useful as a reference to compare your real-time tracker to:
Praat is an open source package sometimes used for pitch extraction by linguists and you can find the algorithm documented at http://www.fon.hum.uva.nl/paul/praat.html
Snack and WaveSurfer also contain a pitch extractor

I know this answer isn't going to make everyone happy but here goes.
This stuff is hard, very hard. Firstly go read as many tutorials as you can find on FFT, Autocorrelation, Wavelets. Although I'm still struggling with DSP I did get some insights from the following.
https://www.coursera.org/course/audio the course isn't running at the moment but the videos are still available.
http://miracle.otago.ac.nz/tartini/papers/Philip_McLeod_PhD.pdf thesis about the development of a pitch recognition algorithm.
http://dsp.stackexchange.com a whole site dedicated to digital signal processing.
If like me you didn't do enough maths to completely follow the tutorials don't give up as some of the diagrams and examples still helped me to understand what was going on.
Next is test data and testing. Write yourself a library that generates test files to use in checking your algorithm/s.
1) A super simple pure sine wave generator. So say you are looking at writing YAT(Yet Another Tuner) then use your sine generator to create a series of files around 440Hz say from 420-460Hz in varying increments and see how sensitive and accurate your code is. Can it resolve to within 5Hz, 1Hz, finer still?
2) Then upgrade your sine wave generator so that it adds a series of weaker harmonics to the signal.
3) Next are real world variations on harmonics. So whilst for most stringed instruments you'll see a series of harmonics as simple multiples of the fundamental frequency F0, for instruments like clarinets and flutes because of the way the air behaves in the chamber the even harmonics will be missing or very weak. And for some instruments F0 is missing but can be determined from the distribution of the other harmonics. F0 being what the human ear perceives as pitch.
4) Throw in some deliberate distortion by shifting the harmonic peak frequencies up and down in an irregular manner
The point being that if you are creating files with known results then its easier to verify that what you are building actually works, bugs aside of course.
There are also a number of "libraries" out there containing sound samples.
https://freesound.org from the Coursera series mentioned above.
http://theremin.music.uiowa.edu/MIS.html
Next be aware that your microphone is not perfect and unless you have spent thousands of dollars on it will have a fairly variable frequency response range. In particular if you are working with low notes then cheaper microphones, read the inbuilt ones in your PC or Phone, have significant rolloff starting at around 80-100Hz. For reasonably good external ones you might get down to 30-40Hz. Go find the data on your microphone.
You can also check what happens by playing the tone through speakers and then recording with you favourite microphone. But of course now we are talking about 2 sets of frequency response curves.
When it comes to performance there are a number of freely available libraries out there although do be aware of the various licensing models.
Above all don't give up after your first couple of tries. Best of luck.

Here's the C++ source code for an unusual two-stage algorithm that I devised which can do Realtime Pitch Detection on polyphonic MP3 files while being played on Windows. This free application (PitchScope Player, available on web) is frequently used to detect the notes of a guitar or saxophone solo upon a MP3 recording. The algorithm is designed to detect the most dominant pitch (a musical note) at any given moment in time within a MP3 music file. Note onsets are accurately inferred by a significant change in the most dominant pitch (a musical note) at any given moment during the MP3 recording.
When a single key is pressed upon a piano, what we hear is not just one frequency of sound vibration, but a composite of multiple sound vibrations occurring at different mathematically related frequencies. The elements of this composite of vibrations at differing frequencies are referred to as harmonics or partials. For instance, if we press the Middle C key on the piano, the individual frequencies of the composite's harmonics will start at 261.6 Hz as the fundamental frequency, 523 Hz would be the 2nd Harmonic, 785 Hz would be the 3rd Harmonic, 1046 Hz would be the 4th Harmonic, etc. The later harmonics are integer multiples of the fundamental frequency, 261.6 Hz ( ex: 2 x 261.6 = 523, 3 x 261.6 = 785, 4 x 261.6 = 1046 ). Linked at the bottom, is a snapshot of the actual harmonics which occur during a polyphonic MP3 recording of a guitar solo.
Instead of a FFT, I use a modified DFT transform, with logarithmic frequency spacing, to first detect these possible harmonics by looking for frequencies with peak levels (see diagram below). Because of the way that I gather data for my modified Log DFT, I do NOT have to apply a Windowing Function to the signal, nor do add and overlap. And I have created the DFT so its frequency channels are logarithmically located in order to directly align with the frequencies where harmonics are created by the notes on a guitar, saxophone, etc.
Now being retired, I have decided to release the source code for my pitch detection engine within a free demonstration app called PitchScope Player. PitchScope Player is available on the web, and you could download the executable for Windows to see my algorithm at work on a mp3 file of your choosing. The below link to GitHub.com will lead you to my full source code where you can view how I detect the harmonics with a custom Logarithmic DFT transform, and then look for partials (harmonics) whose frequencies satisfy the correct integer relationship which defines a 'pitch'.
My Pitch Detection Algorithm is actually a two-stage process: a) First the ScalePitch is detected ('ScalePitch' has 12 possible pitch values: {E, F, F#, G, G#, A, A#, B, C, C#, D, D#} ) b) and after ScalePitch is determined, then the Octave is calculated by examining all the harmonics for the 4 possible Octave-Candidate notes. The algorithm is designed to detect the most dominant pitch (a musical note) at any given moment in time within a polyphonic MP3 file. That usually corresponds to the notes of an instrumental solo. Those interested in the C++ source code for my Two-Stage Pitch Detection algorithm might want to start at the Estimate_ScalePitch() function within the SPitchCalc.cpp file at GitHub.com.
https://github.com/CreativeDetectors/PitchScope_Player
Below is the image of a Logarithmic DFT (created by my C++ software) for 3 seconds of a guitar solo on a polyphonic mp3 recording. It shows how the harmonics appear for individual notes on a guitar, while playing a solo. For each note on this Logarithmic DFT we can see its multiple harmonics extending vertically, because each harmonic will have the same time-width. After the Octave of the note is determined, then we know the frequency of the Fundamental.

I had a similar problem with microphone input on a project I did a few years back - turned out to be due to a DC offset.
Make sure you remove any bias before attempting FFT or whatever other method you are using.
It is also possible that you are running into headroom or clipping problems.
Graphs are the best way to diagnose most problems with audio.

Take a look at this sample application:
http://www.codeproject.com/KB/audio-video/SoundCatcher.aspx
I realize the app is in C# and you need C++, and I realize this is .Net/Windows and you're on a mac... But I figured his FFT implementation might be a starting reference point. Try to compare your FFT implementation to his. (His is the iterative, breadth-first version of Cooley-Tukey's FFT). Are they similar?
Also, the "random" behavior you're describing might be because you're grabbing data returned by your sound card directly without assembling the values from the byte-array properly. Did you ask your sound card to sample 16 bit values, and then gave it a byte-array to store the values in? If so, remember that two consecutive bytes in the returned array make up one 16-bit audio sample.

Java code for a real-time real detector is available at http://code.google.com/p/freqazoid/.
It works fairly well on any computer running post-2008 real-time Java. The project has been dropped and could be picked up by any interested party. Contact me if you want further details.

Check out aubio, and open source library which includes several state-of-the-art methods for pitch tracking.

I have asked a similar question here:
C/C++/Obj-C Real-time algorithm to ascertain Note (not Pitch) from Vocal Input
EDIT:
Performous contains a C++ module for realtime pitch detection
Also Yin Pitch-Tracking algorithm

You could do real time pitch detection, be it of a singer's voice, with TarsosDSP
https://github.com/JorenSix/TarsosDSP
just in case anyone hasn't heard of it yet :-)

Can you adapt anything from instrument tuners? My delightfully compact guitar tuner is able to detect the pitch of the strings pretty well. I see this reference to a piano tuner which explains an algorithm to some extent.

Here are some open source libraries that implement pitch detection:
WORLD : speech analysis/synthesis toolkit. This is especially suitable if your source signal is voice.
aubio : audio feature extraction library. Implements many pitch detection algorithms.
Pitch detection : a collection of pitch detection algorithms implemented in C++.
dywapitchtrack : a high quality pitch detection algorithm.
YIN : another implementation of the YIN algorithm in a single C++ source file.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string