How can I look for certain sounds in a live sound input? - audio

I've combed StackOverflow and the web for many questions on whistle detection, etc, and many people did explain as much as they could as to how they can go about detecting their stuff.
capturing sound for analysis and visualizing frequences in android
analyzing whistle sound for pitch note
But what I don't get is how does FFT help you to detect certain sounds in a given sample audio data?
Here's what I understand so far from some stuff I found here and there.
-The sine wave is more or less the building block of ALL signals, musical or not
-Three parameters - FREQUENCY, AMPLITUDE, and INITIAL PHASE, characterize every steady sine wave completely.
-They make each and any kind of wave unique.
-Fourier transform can be used to inspect what kinds of sine waves there are in a signal
SOURCE -- [Audio signal processing basics][3]
Audio data that the computer generates as received from the mic or other input source, for live processing, is an array of amplitudes processed (or stored or taken) at a particular sample rate.
So how does one go from that to detecting whistles and claps?
And complex things such as say, a short period of whistling to a particular song?
My theory of detecting is that we test our whistles in a spectogram, and record the particular frequency and amplitude characteristics. And then if those particular characteristics are repeated again in the input, we've detected a whistle.
Am I right or wrong?
This sound processing stuff is a little complicated.
Forgot to mention this - I'm using Python. Java is also okay, since most of the examplar code I found was for Android which is in Java. And I can work in Java too. Any mention of any libraries or APIs would be helpful too.

Related

Fieldwork audio recording for acoustic analysis: stereo or mono? appropriate gain?

I work in the field of phonetics and often need to record human speech for acoustic analysis. I have two questions that I couldn't find answers:
If I record in stereo channels, I need to convert to mono later on to proceed with annotation. So in principle mono signal is good enough. Are there reasons that stereo sound should be used (e.g. the signal would be better?)
Also, we were warned that the gain level should be kept small so that the recording level shouldn't exceed the maximum, which leads to signal cuttoff. However, I was also criticised when the recording file shows too low an amplitude (it's still very clear though), for that leads to a low SNR. How do people choose an appropriate gain level?
As the act of recording is involved, the Sound Design forum might be your best bet.
I can't think anything that might be gained, in terms of frequency analysis, by having a stereo signal. Stereo is more about locating the source of a sound in 3D space. Does the source of sound emit different frequency profiles in different directions? Does the environment filter the sound differently over the course of the two paths to the stereo inputs? If the the answer is "not significantly" then mono should be fine.
Choosing an appropriate gain level is mostly a matter of knowing your equipment. Ideally, your recording setup will provide feedback (usually a visual meter of some sort) that shows the signal strength. The "best" would be (theoretically) the loudest level that does not distort. So you have to know at what level distortion happens on all the elements of the recording chain.
There can be some fudging on this, given that the loudest peak on a recorded segment may be an outlier.

Methods for simulating moving audio source

I'm currently researching an problem regarding DOA (direction of arrival) regression for an audio source, and need to generate training data in the form of audio signals of moving sound sources. In particular, I have the stationary sound files, and I need to simulate a source and microphone(s) with the distances between them changing to reflect movement.
Is there any software online that could potentially do the trick? I've looked into pyroomacoustics and VA as well as other potential libraries, but none of them seem to deal with moving audio sources, due to the difficulties in simulating the doppler effect.
If I were to write up my own simulation code for dealing with this, how difficult would it be? My use case would be an audio source and a microphone in some 2D landscape, both moving with their own velocities, where I would want to collect the recording from the microphone as an audio file.
Some speculation here on my part, as I have only dabbled with writing some aspects of what you are asking about and am not experienced with any particular libraries. Likelihood is good that something exists and will turn up.
That said, I wonder if it would be possible to use either the Unreal or Unity game engine. Both, as far as I can remember, grant the ability to load your own cues and support 3D including Doppler.
As far as writing your own, a lot depends on what you already know. With a single-point mike (as opposed to stereo) the pitch shifting involved is not that hard. There is a technique that involves stepping through the audio file's DSP data using linear interpolation for steps that lie in between the data points, which is considered to have sufficient fidelity for most purposes. Lot's of trig, too, to track the changes in velocity.
If we are dealing with stereo, though, it does get more complicated, depending on how far you want to go with it. The head masks high frequencies, so real time filtering would be needed. Also it would be good to implement delay to match the different arrival times at each ear. And if you start talking about pinnas, I'm way out of my league.
As of now it seems like Pyroomacoustics does not support moving sound sources. However, do check a possible workaround suggested by the developers here in Issue #105 - where the idea of using a time-varying convolution on a dense microphone array is suggested.

What algorithm is used by Sox (Swiss Army Knife) for Silence and Noise removal

I have tried Sox for removing silence and Noise from an audio file. I would like to know technical details of it to understand it. This is important to understand it before professional software can rely on it (I know it works great and has been used by many)
When Noise is sampled using Noise Profile, and then removed using Noisered, what is actually Sox doing in this process? Similarly when VAD effect is added. Is there technical explanation of that or some paper published which I can read to understand it.
I have a background in signal processing due to my studies (scientific basics of speech and music, communication sciences) and just had a look into the code of the noise reduction algorithm of sox.
Without analyzing it too deeply, it seems like it is doing an FFT of the noise profile and the original signal, then subtracts the first from the latter and performs an FFT synthesis again to re-create a signal similar to the original.
By this process it should reduce all the frequencies by the amount they appear in the noise signal.
The whole process seems to be done window-by-window which should allow streaming.
As I said, this is just based on my background knowledge and the short glance I took at the code, so there might be aspects which I didn't grasp.
EDIT:
I also had a glance at the VAD code; that one seems to monitor the spectrum for frequencies appearing in the specified range and if so, declares this as "voice". All parts (windows) not declared "voice" are then silenced (AFAICS). Effectively this shall remove all background noise in a pure-voice recording.

How can I detect the sound in a raw sound file

I am developing a software which can auto record and extract every words in my voice. I used portaudio library to solve it. But I am stuck on detecting the sound: I set the silence's value is zero so if there is a sample which is zero, it must be a start or end point of a sound. But when I ran it, the program created many words. I think because the value I read by portaudio is raw data, so it can't be processed like that. Am I right? How can I fix it? By the way, I am coding in C++ :D
To detect the presence of a signal in a PCM stream you be able to detect it. As dprogramz put said, the noise floor of your soundcard is probably not perfect and so there will be some noise signal recorded (even with no mic connected).
The solution is to use a VOX or VAD algorithm to detect the presence of your voice. VOX can be tricky, since in most consumer grade electronics the noise floor is just low enough to be "silence" to the human ear, relative to the signal. This means that the difference on amplitude between the noise floor and signal may be slight. If your sound card has AGC turned on this can make it even more difficult, since the noise floor may move. Having said that, VOX can be implemented successfully on consumer grade equipment. It just takes more effort to establish the threshold. When done best the threshold is calculated periodically while the stream is active.
If I were doing this I'd implement a VAD algorithm. Since your objective is to detect your voice this should provide a reliable result regardless of the equipment you use.
I don't think it's because it is a RAW value. RAW sound files are a bitstream of frequency and volume information.
However, the value will rarely (if ever) be zero. You have to take into account there is a small amount of electrical noise that is made by the mic. Figure out the "idle" dB of your mic (just test the level when you aren't talking into it). You Then need to set a silence threshold (below a certain dB level for a certain number of samples) to detect the beginning/end. Attempting to detect a zero value is gonna be near impossible.

Can FFT be used to find drum solos/breaks in audio files?

Is it possible with FFT to find a drum solo, or a drum break, in an audio file? Is this something FFT is able to do and are there any resources online that could aid me with learning?
In general, a FFT is not a good choice for detecting the onset of percussion sounds:
An FFT is always calculated over a window of samples (in effect a period of time) and yields the magnitude of signal within the bin and its phase offset. You can therefore determine that there is signal at that particular bin, but not its onset time. The best time resolution available is the window period. Of course, you can make the period shorter at the expense of frequency resolution.
Percussion sounds tend to look like noise and spread across the spectrum. This would be OK if you only had percussions sounds, but is not great in real-life polyphonic content.
However, you might be able to find some inference from the different characteristics of the spectra of a drum solo vs instrumental sections of a track.
The problem of finding the time at which percussion sounds start in music is described in academic journals as onset dectection and is one of the many techniques used for feature extraction; the wider field is known as Music Information Retrieval. Your problem sounds like one of identifying sections in audio files and this might be described as partitioning
A good place to start is Sonic Visualiser which is a tool written specifically for MIR applications. Plug-ins exist for various types of feature extraction. From these you will be able to easily find the large body of academic work in this area. There is an added bonus that the existing plug-ins are all open source too.
I'd look here, there was a bit of discussion with great pointers on the Gamedev SE: https://gamedev.stackexchange.com/questions/9761/beat-detection-and-fft :-)

Resources