I am developing a software which can auto record and extract every words in my voice. I used portaudio library to solve it. But I am stuck on detecting the sound: I set the silence's value is zero so if there is a sample which is zero, it must be a start or end point of a sound. But when I ran it, the program created many words. I think because the value I read by portaudio is raw data, so it can't be processed like that. Am I right? How can I fix it? By the way, I am coding in C++ :D
To detect the presence of a signal in a PCM stream you be able to detect it. As dprogramz put said, the noise floor of your soundcard is probably not perfect and so there will be some noise signal recorded (even with no mic connected).
The solution is to use a VOX or VAD algorithm to detect the presence of your voice. VOX can be tricky, since in most consumer grade electronics the noise floor is just low enough to be "silence" to the human ear, relative to the signal. This means that the difference on amplitude between the noise floor and signal may be slight. If your sound card has AGC turned on this can make it even more difficult, since the noise floor may move. Having said that, VOX can be implemented successfully on consumer grade equipment. It just takes more effort to establish the threshold. When done best the threshold is calculated periodically while the stream is active.
If I were doing this I'd implement a VAD algorithm. Since your objective is to detect your voice this should provide a reliable result regardless of the equipment you use.
I don't think it's because it is a RAW value. RAW sound files are a bitstream of frequency and volume information.
However, the value will rarely (if ever) be zero. You have to take into account there is a small amount of electrical noise that is made by the mic. Figure out the "idle" dB of your mic (just test the level when you aren't talking into it). You Then need to set a silence threshold (below a certain dB level for a certain number of samples) to detect the beginning/end. Attempting to detect a zero value is gonna be near impossible.
Related
I work in the field of phonetics and often need to record human speech for acoustic analysis. I have two questions that I couldn't find answers:
If I record in stereo channels, I need to convert to mono later on to proceed with annotation. So in principle mono signal is good enough. Are there reasons that stereo sound should be used (e.g. the signal would be better?)
Also, we were warned that the gain level should be kept small so that the recording level shouldn't exceed the maximum, which leads to signal cuttoff. However, I was also criticised when the recording file shows too low an amplitude (it's still very clear though), for that leads to a low SNR. How do people choose an appropriate gain level?
As the act of recording is involved, the Sound Design forum might be your best bet.
I can't think anything that might be gained, in terms of frequency analysis, by having a stereo signal. Stereo is more about locating the source of a sound in 3D space. Does the source of sound emit different frequency profiles in different directions? Does the environment filter the sound differently over the course of the two paths to the stereo inputs? If the the answer is "not significantly" then mono should be fine.
Choosing an appropriate gain level is mostly a matter of knowing your equipment. Ideally, your recording setup will provide feedback (usually a visual meter of some sort) that shows the signal strength. The "best" would be (theoretically) the loudest level that does not distort. So you have to know at what level distortion happens on all the elements of the recording chain.
There can be some fudging on this, given that the loudest peak on a recorded segment may be an outlier.
What I want is to be able to get a signal at my raspberry pi at home when I'm not at home so I can e.g. wake up my PC. I always have an old phone lying around that I never really use. So I thought, I can call my phone, a specific mp3 ringtone plays, my raspberry pi listens and recognizes the ringtone and therefore the signal. So I can pretty much chose whatever ringtone I want (but hopefully a not too long one). But the problem is, that it should be recognizable by the raspberry and it should be distinguishable from other sounds. At best I can play random music at home and it will not get signalled until it's the specific ringtone i chose.
So I'm at the very beginning of the project and I have a lot of question. Is this even feasible? How do I listen to the ringtone? Should I use a normal microphone or could I e.g. trigger some gpio pin as long as a specific frequency is played? What kind of ringtone should I use to be as distinguishable as possible? And how to create the software to recognize the sound?
I know this is a lot and I don't expect a step by step solution. But maybe you got some hints to get me in the right direction?
If someone has a similar problem, I found a solution: First I had to choose between a mostly hardware solution and a mostly software solution. The hardware solution is to filter specific frequencies. This seems to be pretty hard using normal band-pass filters if you want narrow bands. There are also components that can do that, now I know of the NE567. But this component only reacts to one frequency and takes quite a lot of energy. To recognize a ringtone, more of these components are needes which means more power consumption. Additionally this solution is pretty unflexible.
So I went for the software solution. Now I have an Arduino Uno that gets an amplified electret microphone signal at an analog input pin. The data is collected and simultaneously analysed with an FFT algorithm. Then I check the dominant frequency if there is any and safe it in an array. Everytime a got a new data point I compare the array with the pattern of my ringtone and calculate a score for the match. If the score is big enough the ringtone is "found" and I can trigger my event.
I'm actually pretty pleased with the solution because it works quite well even with the phone some feet away from the microphone. I thought I need to put the microphone almost directly next to the phone to get good results, but I dont have to. It's still a little sensitive, because the sound volume shouldnt be too high or to low. But with the right volume settings it works with a quite big area when the phone is in the same room. It works even better with some space between microphone and phone, because the phones radiation from the call seems to disturb the circuit quite a lot. There is also the problem, that other noises block the ringtone recognition. I could compensate that with my algorithm, but I almost used up all resources of the Arduino, so I had to keep the algorithm simple. But in my case I dont have a noisy environment, so this is not a problem for me. Another pro is that my event was never triggered from another sound and it seems almost impossible that this could happen by accident.
So it is feasible and I think its actually a quite elegant solution. I also thought about a vibration detection or even directly using the vibration motor's signal but I have no control over the vibration function of that old phone. But I can chose the ringtone for every contact, so I only gave the "magic" ringtone to myself and so the event can only be triggered by myself. I only have to say, that writing the software was kind of hard with the Arduinos limitations. Because I need the data in real time I have limited time for the calculation. I had to limit the incomping data and therefore I can only listen to frequencies up to 10kHz. But the ringtone recognition is still possible and I think it was worth the effort. :)
I've combed StackOverflow and the web for many questions on whistle detection, etc, and many people did explain as much as they could as to how they can go about detecting their stuff.
capturing sound for analysis and visualizing frequences in android
analyzing whistle sound for pitch note
But what I don't get is how does FFT help you to detect certain sounds in a given sample audio data?
Here's what I understand so far from some stuff I found here and there.
-The sine wave is more or less the building block of ALL signals, musical or not
-Three parameters - FREQUENCY, AMPLITUDE, and INITIAL PHASE, characterize every steady sine wave completely.
-They make each and any kind of wave unique.
-Fourier transform can be used to inspect what kinds of sine waves there are in a signal
SOURCE -- [Audio signal processing basics][3]
Audio data that the computer generates as received from the mic or other input source, for live processing, is an array of amplitudes processed (or stored or taken) at a particular sample rate.
So how does one go from that to detecting whistles and claps?
And complex things such as say, a short period of whistling to a particular song?
My theory of detecting is that we test our whistles in a spectogram, and record the particular frequency and amplitude characteristics. And then if those particular characteristics are repeated again in the input, we've detected a whistle.
Am I right or wrong?
This sound processing stuff is a little complicated.
Forgot to mention this - I'm using Python. Java is also okay, since most of the examplar code I found was for Android which is in Java. And I can work in Java too. Any mention of any libraries or APIs would be helpful too.
I'm trying to compare sound clips based on microphone recording. Simply put I play an MP3 file while recording from the speakers, then attempt to match the two files. I have the algorithms in place that works, but I'm seeing a slight difference I'd like to sort out to get better accuracy.
The microphone seem to favor some frequencies (add amplitude), and be slightly off on others (peaks are wider on the mic).
I'm wondering what the cause of this difference is, and how to compensate for it.
Background:
Because of speed issues in how I'm doing comparison I select certain frequencies with certain characteristics. The problem is that a high percentage of these (depending on how many I choose) don't match between MP3 and mic.
It's called the response characteristic of the microphone. Unfortunately, you can't easily get around it without buying a different, presumably more expensive, microphone.
If you can measure the actual microphone frequency response by some method (which generally requires having some etalon acoustic system and an anechoic chamber), you can compensate for it by applying an equaliser tuned to exactly inverse characteristic, like discussed here. But in practice, as Kilian says, it's much simpler to get a more precise microphone. I'd recommend a condenser or an electrostatic one.
After watching the first game of the FIFA worldcup I was very annoyed by the sound of the Vuvuzelas. A theoretical question came up about filtering that noise out of the sound stream.
What algorithms are needed to remove such a "constant" noise and is it possible to keep the quality of other background sounds?
Well, you could do it very easily if you had a live secondary microphone only/mostly picking up the vuvuzelas (i.e. how noise-canceling headphones work). Or, you could identify the frequency signature of the vuvuzelas from a sample, you could loop and counter that with destructive interference. It would not be as effective as the live version, of course.