How i recognize a unique sound in a noisy environment? - audio

I am developing app to detect the inability of elderly people to unlock their rooms using IC cards in their daycare center.
This room doors has an electronic circuit that emits beep sounds d to signal the user failure in unlock the room. My goal is to detect this beep signal.
I have searched a lot and found some possibilities:
To clip the beep sound and use as a template signal and compare it with test signal (the complete human door interaction audio clip) using convolution, matched filters, DTW or what so ever to measure their similarity. What do u recommend and how to implement it.
To analyze the FFT of beep sound to see if it has a frequency band different that of the background noise. I do not understand how to do it exactly?
To check whether the beep sound form a peak at certain frequency spectrum that is absent in the background noise. If so, Implement a freclipped the beep sound and got the spectrogram as shown in the figure spectrogram of beep sound. but i cannot interpret it? could u give me a detailed explanation of the spectrogram.
3.What is your recommendation? If you have other efficient method for beep detection, please explain.

There is no need to calculate the full spectrum. If you know the frequency of the beep, you can just do a single point DFT and continuously check the level at that frequency. If you detect a rising and falling edge within a given interval it must be the beep sound.
You might want to have a look at the Goertzel Algorithm. It is an algorithm for continuous single point DFT calculation.

Related

Algorithm to deal with Audio click/pop sounds

I am making a sound engine where I can play and stop sound. My issue is if a user wants to stop the sound I immediately stop it ie I send 0 as PCM value. This has the consequence of producing a pop / click sound because the PCM value drops from lets say 0.7 to 0 immediately causing a pop/click sound which is very annoying to hear.
Here is a discussion about this.
I am looking for an algorithm or a way to deal with these audio clicks / pops. What is the best practice for dealing audio clicks? Is there a universal way to go about this? I am very new to audio DSP and I could not find a good answer for this.
When you cut off the sound abruptly, you are multiplying it by a step-shaped signal.
When you multiply two signals together, you convolve their frequencies. A step-shape has energy at all frequencies, so the multiplication will spread the energy from the sound over all frequencies, making an audible pop.
Instead, you want to fade the sound out over 30ms or so -- that is still very fast, and will sound like an abrupt stop, but there will be no audible pop.
You should use a curve shaped like 1-t2 to modulate the volume, or something else without significant high-frequency components. That way, when it is convolved with the original sound in the frequency domain, it won't produce any new frequencies.

I need to analyse many audio WAV files for characteristic noise, ideas?

I need to be able to analyze (search thru) hundreds of WAV files and detect but not remove static noise. As done currently now, I must listen to each conversation and find the characteristic noise/static manually, which takes too much time. Ideally, I would need a program that can read each new WAV file and be able to detect characteristic signatures of the static noise such as periods of bursts of white noise or full audio band, high amplitude noise (like AM radio noise over phone conversation such as a wall of white noise) or bursts of peek high frequency high amplitude (as in crackling on the phone line) in a background of normal voice. I do not need to remove the noise but simply detect it and flag the recording for further troubleshooting. Ideas?
I can listen to the recordings and find the static or crackling but this takes time. I need an automated or batch process that can run on its own and flag the troubled call recordings (WAV files for a phone PBX). These are SIP and analog conversations depending on the leg of the conversation so RTSP/SIP packet analysis might be an option, but the raw WAV file is the simplest. I can use Audacity, but this still requires opening each file and looking at the visual representation of the audio spectrometry and is only a little faster than listening to each call but still cumbersome.
I currently have no code or methods for this task. I simply listen to each call wav file to find the noise.
I need a batch Wav file search that can render wav file recordings that contain the characteristic noise or static or crackling over the recording phone conversation.
Unless you can tell the program how the noise looks like, it's going to be challenging to run any sort of batch processing. I was facing a similar challenge and that prompted me to develop (free and open source) software to help user in audio exploration, analysis and signal separation:
App: https://audioexplorer.online/
Docs: https://tracek.github.io/audio-explorer/
Source code: https://github.com/tracek/audio-explorer
Essentially, it visualises audio as a 2d scatter plot rather than only "linear", as in waveform or spectrogram. When you upload audio the following happens:
Onsets are detected (based on high-frequency content algorithm from aubio) according to the threshold you set. Set it to None if you want all.
Per each audio fragment, calculate audio features based on your selection. There's no universal best set of features, all depends on the application. You might try for starter with e.g. Pitch statistics. Consider setting proper values for bandpass filter and sample length (that's the length of audio fragment we're going to use). Sample length could be in future established dynamically. Check docs for more info.
The result is that for each fragment you have many features, e.g. 6 or 60. That means we have then k-dimensional (where k is number of features) structure, which we then project to 2d space with dimensionality reduction algorithm of your selection. Uniform Manifold Approximation and Projection is a sound choice.
In theory, the resulting embedding should be such that similar sounds (according to features we have selected) are closely together, while different further apart. Your noise should be now separated from your "not noise" and form cluster.
When you hover over the graph, in right-upper corner a set of icons appears. One is lasso selection. Use it to mark points, inspect spectrogram and e.g. download table with features that describe that signal. At that moment you can also reduce the noise (extra button appears) in a similar way to Audacity - it analyses the spectrum and reduces these frequencies with some smoothing.
It does not completely solve your problem right now, but could severely cut the effort. Going through hundreds of wavs could take better part of the day, but you will be done. Want it automated? There's CLI (command-line interface) that I am developing at the same time. In not-too-distant future it should take what you have labelled as noise and signal and then use supervised machine learning to go through everything in batch mode.
Suggestions / feedback? Drop an issue on GitHub.

Is it possible to, as accurately as possible, decompose an audio into MIDI, given the SoundFont that was used?

If I know the SoundFont that a MIDI to audio track has used, can I theoretically reverse the audio back into it's (most likely) MIDI components? If so, what would be one of the best approach to doing this?
The end goal is to try encoding audio (even voice samples) into MIDI such that I can reproduce the original audio in MIDI format better than, say, BearFileConverter. Hopefully with better results than just bandpass filters or FFT.
And no, this is not for any lossy audio compression or sheet transcription, this is mostly for my curiosity.
For monophonic music only, with no background sound, and if your SoundFont synthesis engine and your record sample rates are exactly matched (synchronized to 1ppm or better, have no additional effects, also both using a known A440 reference frequency, known intonation, etc.), then you can try using a set of cross correlations of your recorded audio against a set of synthesized waveform samples at each MIDI pitch from your a-priori known font to create a time line of statistical likelihoods for each MIDI note. Find the local maxima across your pitch range, threshold, and peak pick to find the most likely MIDI note onset times.
Another possibility is sliding sound fingerprinting, but at an even higher computational cost.
This fails in real life due to imperfectly matched sample rates plus added noise, speaker and room acoustic effects, multi-path reverb, and etc. You might also get false positives for note waveforms that are very similar to their own overtones. Voice samples vary even more from any template.
Forget bandpass filters or looking for FFT magnitude peaks, as this works reliably only for close to pure sinewaves, which very few musical instruments or interesting fonts sound like (or are as boring as).

Using microphone input to create a music visualization in real time on a 3D globe

I am involved in a side project that has a loop of LEDs around 1.5m in diameter with a rotor on the bottom which spins the loop. A raspberry pi controls the LEDs so that they create what appears to be a 3D globe of light. I am interested in a project that takes a microphone input and turns it into a column of pixels which is rendered on the loop in real time. The goal of this is to see if we can have it react to music in real-time. So far I've come up with this idea:
Using a FFT to quickly turn the input sound into a function that maps to certain pixels to certain colors based on the amplitude of the resultant function at frequencies, so the equator of the globe would respond to the strength of the lower-frequency sound, progressing upwards towards the poles which would respond to high frequency sound.
I can think of a few potential problems, including:
Performance on a raspberry pi. If the response lags too far behind the music it wouldn't seem to the observer to be responding to the specific song he/she is also hearing.
Without detecting the beat or some overall characteristic of the music that people understand it might be difficult for the observers to understand the output is correlated to the music.
The rotor has different speeds, so the image is only stationary if the rate of spin is matched perfectly to the refresh rate of the LEDs. This is a problem, but also possibly helpful because I might be able to turn down both the refresh rate and the rotor speed to reduce the computational load on the raspberry pi.
With that backstory, I should probably now ask a question. In general, how would you go about doing this? I have some experience with parallel computing and numerical methods but I am totally ignorant of music and tone and what-not. Part of my problem is I know the raspberry pi is the newest model, but I am not sure what its parallel capabilities are. I need to find a few linux friendly tools or libraries that can do an FFT on an ARM processor, and be able to do the post-processing in real time. I think a delay of ~0.25s or about would be acceptable. I feel like I'm in over my head so I thought id ask you guys for input.
Thanks!

Theory behind Autotune/vocoder

I've been hunting all over the web for material about vocoder or autotune, but haven't got any satisfactory answers. Could someone in a simple way please explain how do you autotune a given sound file using a carrier sound file?
(I'm familiar with ffts, windowing, overlap etc., I just don't get the what do we do when we have the ffts of the carrier and the original sound file which has to be modulated)
EDIT: After looking around a bit more, I finally got to know exactly what I was looking for -- a channel vocoder. The way it works is, it takes two inputs, one a voice signal and the other a musical signal rich in frequency. The musical signal is modulated by the envelope of the voice signal, and the output signal sounds like the voice singing in the musical tone.
Thanks for your help!
Using a phase vocoder to adjust pitch is basically pitch estimation plus interpolation in the frequency domain.
A phase vocoder reconstruction method might resample the frequency spectrum at, potentially, a new FFT bin spacing to shift all the frequencies up or down by some ratio. The phase vocoder algorithm additionally uses information shared between adjacent FFT frames to make sure this interpolation result can create continuous waveforms across frame boundaries. e.g. it adjusts the phases of the interpolation results to make sure that successive sinewave reconstructions are continuous rather than having breaks or discontinuities or phase cancellations between frames.
How much to shift the spectrum up or down is determined by pitch estimation, and calculating the ratio between the estimated pitch of the source and that of the target pitch. Again, phase vocoders use information about any phase differences between FFT frames to help better estimate pitch. This is possible by using more a bit more global information than is available from a single local FFT frame.
Of course, this frequency and phase changing can smear out transient detail and cause various other distortions, so actual phase vocoder products may additionally do all kinds of custom (often proprietary) special case tricks to try and fix some of these problems.
The first step is pitch detection. There are a number of pitch detection algorithms, introduced briefly in wikipedia: http://en.wikipedia.org/wiki/Pitch_detection_algorithm
Pitch detection can be implemented in either frequency domain or time domain. Various techniques in both domains exist with various properties (latency, quality, etc.) In the F domain, it is important to realize that a naive approach is very limiting because of the time/frequency trade-off. You can get around this limitation, but it takes work.
Once you've identified the pitch, you compare it with a desired pitch and determine how much you need to actually pitch shift.
Last step is pitch shifting, which, like pitch detection, can be done in the T or F domain. The "phase vocoder" method other folks mentioned is the F domain method. T domain methods include (in increasing order of quality) OLA, SOLA and PSOLA, some of which you can read about here: http://www.scribd.com/doc/67053489/60/Synchronous-Overlap-and-Add-SOLA
Basically you do an FFT, then in the frequency domain you move the signals to the nearest perfect semitone pitch.

Resources