Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I am born with Single-Sided Death which, in short, means that you're death on one of your ears. Only being able to hear on one ear, makes you lose the ability to hear direction and distance of sound.
I have wondered whether it is possible to determine direction of a sound given a stereo stream of sound. For simplification, just for two channels (left and right) and with the assumption that there does not exist background noise.
What would be a good initial strategy to cope with this problem?
Thanks in advance.
This seems like a interesting problem.
Also since this is a hypothetical scenario, the solution/ idea given below is also probable and will depend upon lots of factors.
Considering we have a stereo audio , using pydub we can split that into two mono channels using:
AudioSegment(…).split_to_mono()
Splits a stereo AudioSegment into two, one for each channel (Left/Right). Returns a list with the new AudioSegment objects with the left channel at index 0 and the right channel at index 1.
Then we can figure out which channel is loudest using
AudioSegment(…).split_to_mono()
Splits a stereo AudioSegment into two, one for each channel (Left/Right). Returns a list with the new AudioSegment objects with the left channel at index 0 and the right channel at index 1.
Then we measure loudness in each channel using:
AudioSegment(…).rms
A measure of loudness. Used to compute dBFS, which is what you should use in most cases. Loudness is logarithmic (rms is not), which makes dB a much more natural scale.
So for test, I used a stereo music wave file and split to two mono channels and checked its loudness to see which channel is loudest.
from pydub import AudioSegment
sound = AudioSegment.from_file("audio.wav")
split_sound = sound.split_to_mono()
left_loudness = split_sound[0].rms
right_loudness = split_sound[1].rms
Output
>>> left_loudness
7030
>>> right_loudness
6993
>>>
Related
I need to be able to analyze (search thru) hundreds of WAV files and detect but not remove static noise. As done currently now, I must listen to each conversation and find the characteristic noise/static manually, which takes too much time. Ideally, I would need a program that can read each new WAV file and be able to detect characteristic signatures of the static noise such as periods of bursts of white noise or full audio band, high amplitude noise (like AM radio noise over phone conversation such as a wall of white noise) or bursts of peek high frequency high amplitude (as in crackling on the phone line) in a background of normal voice. I do not need to remove the noise but simply detect it and flag the recording for further troubleshooting. Ideas?
I can listen to the recordings and find the static or crackling but this takes time. I need an automated or batch process that can run on its own and flag the troubled call recordings (WAV files for a phone PBX). These are SIP and analog conversations depending on the leg of the conversation so RTSP/SIP packet analysis might be an option, but the raw WAV file is the simplest. I can use Audacity, but this still requires opening each file and looking at the visual representation of the audio spectrometry and is only a little faster than listening to each call but still cumbersome.
I currently have no code or methods for this task. I simply listen to each call wav file to find the noise.
I need a batch Wav file search that can render wav file recordings that contain the characteristic noise or static or crackling over the recording phone conversation.
Unless you can tell the program how the noise looks like, it's going to be challenging to run any sort of batch processing. I was facing a similar challenge and that prompted me to develop (free and open source) software to help user in audio exploration, analysis and signal separation:
App: https://audioexplorer.online/
Docs: https://tracek.github.io/audio-explorer/
Source code: https://github.com/tracek/audio-explorer
Essentially, it visualises audio as a 2d scatter plot rather than only "linear", as in waveform or spectrogram. When you upload audio the following happens:
Onsets are detected (based on high-frequency content algorithm from aubio) according to the threshold you set. Set it to None if you want all.
Per each audio fragment, calculate audio features based on your selection. There's no universal best set of features, all depends on the application. You might try for starter with e.g. Pitch statistics. Consider setting proper values for bandpass filter and sample length (that's the length of audio fragment we're going to use). Sample length could be in future established dynamically. Check docs for more info.
The result is that for each fragment you have many features, e.g. 6 or 60. That means we have then k-dimensional (where k is number of features) structure, which we then project to 2d space with dimensionality reduction algorithm of your selection. Uniform Manifold Approximation and Projection is a sound choice.
In theory, the resulting embedding should be such that similar sounds (according to features we have selected) are closely together, while different further apart. Your noise should be now separated from your "not noise" and form cluster.
When you hover over the graph, in right-upper corner a set of icons appears. One is lasso selection. Use it to mark points, inspect spectrogram and e.g. download table with features that describe that signal. At that moment you can also reduce the noise (extra button appears) in a similar way to Audacity - it analyses the spectrum and reduces these frequencies with some smoothing.
It does not completely solve your problem right now, but could severely cut the effort. Going through hundreds of wavs could take better part of the day, but you will be done. Want it automated? There's CLI (command-line interface) that I am developing at the same time. In not-too-distant future it should take what you have labelled as noise and signal and then use supervised machine learning to go through everything in batch mode.
Suggestions / feedback? Drop an issue on GitHub.
I have created a really basic FFT visualizer using a Teensy microcontroller, a display panel, and a pair of headphone jacks. I used kosme's FFT library for Arduino: https://github.com/kosme/arduinoFFT
Analog audio flows into the headphone input and to a junction where the microcontroller samples it. That junction is also connected to an audio out jack so that audio can be passed to some speakers.
This is all fine and good, but currently I'm only sampling the left audio channel. Any time music is stereo separated, the visualization cannot account for any sound on the right channel. I want to rectify this but I'm not sure whether I should start with hardware or software.
Is there a circuit I should build to mix the left and right audio channels? I figure I could do something like so:
But I'm pretty sure that my schematic is misguided. I included bias voltage to try and DC couple the audio signal so that it will properly ride over the diodes. Making sure that the output matches the input is important to me though.
Or maybe should this best be approached in software? Should I instead just be sampling both channels separately and then doing some math to combine them?
Combining the stereo channels of one end of the fork without combining the other two is very difficult. Working in software is much easier.
If you take two sets of samples, you've doubled the amount of math that the microcontroller needs to do.
But if you take readings from both pins and divide them by two, you can add them together and have one set of samples which represents the 'mono' signal.
Keep in mind that human ears have an uneven response to sound volumes, so a 'medium' volume reading on both pins, summed and halved, will result in a 'lower-medium' value. It's better to divide by 1.5 or 1.75 if you can spare the cycles for more complicated division.
What are a few of the simplest features of a sound wave that are correlated with differences between voices of different speakers? Where can I find how to calculate these from data?
The project we're trying to solve takes 3 hours of audio as input, each moment of which has either one of two people talking or silence. We need to determine which of these alternatives each moment corresponds to using hmm.
The results don't have to be 100% accurate. An approximation will suffice.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
How could I differentiate between two people speaking? As in if someone says "hello" and then another person says "hello" what kind of signature should I be looking for in the audio data? periodicity?
Thanks a lot to anyone who can answer this!
The solution to this problem lies in Digital Signal Processing (DSP). Speaker recognition is a complex problem which brings computers and communication engineering to work hand in hand. Most techniques of speaker identification require signal processing with machine learning (training over the speaker database and then identification using training data). The outline of algorithm which may be followed -
Record the audio in raw format. This serves as the digital signal which needs to be processed.
Apply some pre-processing routines over the captured signal. These routines could be simply signal normalization, or filtering the signal to remove noise (using band pass filters for normal frequency range of human voice. Band pass filters can in turn be created using a low pass and a high pass filter in combination.)
Once it is fairly certain that the captured signal is pretty much free from noise, feature extraction phase begins. Some of the known techniques which are used for extracting voice features are - Mel-Frequency Cepstral Coefficients (MFCC), Linear Predictive Coding (LPC) or simple FFT features.
Now, there are two phases - training and testing.
First the system needs to be trained over the voice features of different speakers before it is capable to distinguish between them. In order to ensure that the features are correctly calculated, it is recommended that several (>10) samples of voice from speakers must be collected for training purposes.
Training can be done using different techniques like neural networks or distance based classification to find the differences in the features of voices from different speakers.
In testing phase, the training data is used to find the voice feature set which lies at the lowest distance from the signal being tested. Different distances like Euclidean or Chebyshev distances might be used to calculate this proximity.
There are two open source implementations which enable speaker identification - ALIZE: http://mistral.univ-avignon.fr/index_en.html and MARF: http://marf.sourceforge.net/.
I know its a bit late to answer this question, but I hope someone finds it useful.
This is an extremely hard problem, even for experts in speech and signal processing. This page has much more information: http://en.wikipedia.org/wiki/Speaker_recognition
And some suggested technology starting points:
The various technologies used to
process and store voice prints include
frequency estimation, hidden Markov
models, Gaussian mixture models,
pattern matching algorithms, neural
networks, matrix representation,Vector
Quantization and decision trees. Some
systems also use "anti-speaker"
techniques, such as cohort models, and
world models.
Having only two people to differentiate, if they are uttering the same word or phrase will make this much easier. I suggest starting with something simple, and only adding complexity as needed.
To begin, I'd try sample counts of the digital waveform, binned by time and magnitude or (if you have the software functionality handy) an FFT of the entire utterance. I'd consider a basic modeling process first, too, such as linear discriminant (or whatever you already have available).
Another way to go is to use an array of microphones and differentiate between the postions and directions of the vocal sources. I consider this to be a easier approach since the position calculation is much less complicated than separating different speakers from a mono or stereo source.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have a sample held in a buffer from DirectX. It's a sample of a note played and captured from an instrument. How do I analyse the frequency of the sample (like a guitar tuner does)? I believe FFTs are involved, but I have no pointers to HOWTOs.
The FFT can help you figure out where the frequency is, but it can't tell you exactly what the frequency is. Each point in the FFT is a "bin" of frequencies, so if there's a peak in your FFT, all you know is that the frequency you want is somewhere within that bin, or range of frequencies.
If you want it really accurate, you need a long FFT with a high resolution and lots of bins (= lots of memory and lots of computation). You can also guess the true peak from a low-resolution FFT using quadratic interpolation on the log-scaled spectrum, which works surprisingly well.
If computational cost is most important, you can try to get the signal into a form in which you can count zero crossings, and then the more you count, the more accurate your measurement.
None of these will work if the fundamental is missing, though. :)
I've outlined a few different algorithms here, and the interpolated FFT is usually the most accurate (though this only works when the fundamental is the strongest harmonic - otherwise you need to be smarter about finding it), with zero-crossings a close second (though this only works for waveforms with one crossing per cycle). Neither of these conditions is typical.
Keep in mind that the partials above the fundamental frequency are not perfect harmonics in many instruments, like piano or guitar. Each partial is actually a little bit out of tune, or inharmonic. So the higher-frequency peaks in the FFT will not be exactly on the integer multiples of the fundamental, and the wave shape will change slightly from one cycle to the next, which throws off autocorrelation.
To get a really accurate frequency reading, I'd say to use the autocorrelation to guess the fundamental, then find the true peak using quadratic interpolation. (You can do the autocorrelation in the frequency domain to save CPU cycles.) There are a lot of gotchas, and the right method to use really depends on your application.
There are also other algorithms that are time-based, not frequency based.
Autocorrelation is a relatively simple algorithm for pitch detection.
Reference: http://cnx.org/content/m11714/latest/
I have written c# implementations of autocorrelation and other algorithms that are readable. Check out http://code.google.com/p/yaalp/.
http://code.google.com/p/yaalp/source/browse/#svn/trunk/csaudio/WaveAudio/WaveAudio
Lists the files, and PitchDetection.cs is the one you want.
(The project is GPL; so understand the terms if you use the code).
Guitar tuners don't use FFT's or DFT's. Usually they just count zero crossings. You might not get the fundamental frequency because some waveforms have more zero crossings than others but you can usually get a multiple of the fundamental frequency that way. That's enough to get the note although you might be one or more octaves off.
Low pass filtering before counting zero crossings can usually get rid of the excess zero crossings. Tuning the low pass filter requires some knowlegde of the range of frequency you want to detect though
FFTs (Fast-Fourier Transforms) would indeed be involved. FFTs allow you to approximate any analog signal with a sum of simple sine waves of fixed frequencies and varying amplitudes. What you'll essentially be doing is taking a sample and decomposing it into amplitude->frequency pairs, and then taking the frequency that corresponds to the highest amplitude.
Hopefully another SO reader can fill the gaps I'm leaving between the theory and the code!
A little more specifically:
If you start with the raw PCM in an input array, what you basically have is a graph of wave amplitude vs time.Doing a FFT will transform that to a frequency histogram for frequencies from 0 to 1/2 the input sampling rate. The value of each entry in the result array will be the 'strength' of the corresponding sub-frequency.
So to find the root frequency given an input array of size N sampled at S samples/second:
FFT(N, input, output);
max = max_i = 0;
for(i=0;i<N;i++)
if (output[i]>max) max_i = i;
root = S/2.0 * max_i/N ;
Retrieval of fundamental frequencies in a PCM audio signal is a difficult task, and there would be a lot to talk about it...
Anyway, usually time-based method are not suitable for polyphonic signals, because a complex wave given by the sum of different harmonic components due to multiple fundamental frequencies has a zero-crossing rate which depends only from the lowest frequency component...
Also in the frequency domain the FFT is not the most suitable method, since frequency spacing between notes follow an exponential scale, not linear. This means that a constant frequency resolution, used in the FFT method, may be insufficient to resolve lower frequency notes if the size of the analysis window in the time domain is not large enough.
A more suitable method would be a constant-Q transform, which is DFT applied after a process of low-pass filtering and decimation by 2 (i.e. halving each step the sampling frequency) of the signal, in order to obtain different subbands with different frequency resolution. In this way the calculation of DFT is optimized. The trouble is that also time resolution is variable, and increases for the lower subbands...
Finally, if we are trying to estimate the fundamental frequency of a single note, FFT/DFT methods are ok. Things change for a polyphonic context, in which partials of different sounds overlap and sum/cancel their amplitude depending from their phase difference, and so a single spectral peak could belong to different harmonic contents (belonging to different notes). Correlation in this case don't give good results...
Apply a DFT and then derive the fundamental frequency from the results. Googling around for DFT information will give you the information you need -- I'd link you to some, but they differ greatly in expectations of math knowledge.
Good luck.