What is the ideal audio level for Sphinx? - audio

On my system, using my USB microphone, I've found that the audio level that works best with CMU Sphinx is about 20% of the maximum. This gives me 75% voice recognition accuracy. If I amplify this digitally I get far worse recognition accuracy (25%). Why is this? What is the recommended audio level for Sphinx? [Also I am using 16,000 samples/sec, 16-bit.]

pocketsphinx decoder uses channel amplitude normalization. Initial normalization value is configured to 20% audio level indeed inside the model (-cmninit parameter in feat.params). However, the level is updated as you decode, so it has only effect on first utterance. If you properly decode in continuous mode, level should not matter. Do not restart recognizer for every utterance, let it adapt to the noise and audio level.

Related

does converting from mulaw to linear impact audio quality?

I want to change audio encoding from mulaw to linear in order to use a linear speech recognition model from Google.
I'm using a telephony channel, so audio is encoded in mulaw, 8bits, 8000Hz.
When I use Google Mulaw model, there are some issue with recognizing some short single words -> basically they are not recognized at all -> API returns None
I was wondering if it is a good practise to change the encoding for Linear or Flac?
I already did it, but I cannot really measure the degree of this improvement.
It is always best practice to use either LINEAR16 for headerless audio data or FLAC for headered audio data. They both provide lossless codec. It is good practice to set the sampling rate to 16000 Hz otherwise you can set the sample_rate_hertz to match the native sample rate of the audio source (instead of re-sampling). Since Google Speech to Text API provides various ways to improve the audio quality, you can use World Level Confidence to measure the accuracy for response.
Ideally the audio would be recorded to start with using lossless codec like linear16 ot flac. But once you have it in format like mulaw transcoding it before sending to Google speech-to-text is not helpful.
Consider using model=phone_call and use_enhanced=true for better telephony quality.
For quick experimentation you can use STT UI https://cloud.google.com/speech-to-text/docs/ui-overview.

Detecting Human Voice within Gamemaker Audio data

I am writing a game that I need to filter out data coming from the microphone to determine if it contains human voice. The data is 16 bit PCM. Is there existing code out there that does this? Or at least some pseudocode that is close that could be implemented to do this?
I'd consider going with RNNoise, a noise suppression model. What is of interest to you is the VAD: Voice Activity Detector. For your purpose, you could discard the noise suppression part and use VAD alone. Values close to 1 indicate voice, 0 otherwise. It's a robust system with a very good performance.
You might consider modifying the neural network to only output VAD and simplifying the model accordingly. That's assuming you want

I need to analyse many audio WAV files for characteristic noise, ideas?

I need to be able to analyze (search thru) hundreds of WAV files and detect but not remove static noise. As done currently now, I must listen to each conversation and find the characteristic noise/static manually, which takes too much time. Ideally, I would need a program that can read each new WAV file and be able to detect characteristic signatures of the static noise such as periods of bursts of white noise or full audio band, high amplitude noise (like AM radio noise over phone conversation such as a wall of white noise) or bursts of peek high frequency high amplitude (as in crackling on the phone line) in a background of normal voice. I do not need to remove the noise but simply detect it and flag the recording for further troubleshooting. Ideas?
I can listen to the recordings and find the static or crackling but this takes time. I need an automated or batch process that can run on its own and flag the troubled call recordings (WAV files for a phone PBX). These are SIP and analog conversations depending on the leg of the conversation so RTSP/SIP packet analysis might be an option, but the raw WAV file is the simplest. I can use Audacity, but this still requires opening each file and looking at the visual representation of the audio spectrometry and is only a little faster than listening to each call but still cumbersome.
I currently have no code or methods for this task. I simply listen to each call wav file to find the noise.
I need a batch Wav file search that can render wav file recordings that contain the characteristic noise or static or crackling over the recording phone conversation.
Unless you can tell the program how the noise looks like, it's going to be challenging to run any sort of batch processing. I was facing a similar challenge and that prompted me to develop (free and open source) software to help user in audio exploration, analysis and signal separation:
App: https://audioexplorer.online/
Docs: https://tracek.github.io/audio-explorer/
Source code: https://github.com/tracek/audio-explorer
Essentially, it visualises audio as a 2d scatter plot rather than only "linear", as in waveform or spectrogram. When you upload audio the following happens:
Onsets are detected (based on high-frequency content algorithm from aubio) according to the threshold you set. Set it to None if you want all.
Per each audio fragment, calculate audio features based on your selection. There's no universal best set of features, all depends on the application. You might try for starter with e.g. Pitch statistics. Consider setting proper values for bandpass filter and sample length (that's the length of audio fragment we're going to use). Sample length could be in future established dynamically. Check docs for more info.
The result is that for each fragment you have many features, e.g. 6 or 60. That means we have then k-dimensional (where k is number of features) structure, which we then project to 2d space with dimensionality reduction algorithm of your selection. Uniform Manifold Approximation and Projection is a sound choice.
In theory, the resulting embedding should be such that similar sounds (according to features we have selected) are closely together, while different further apart. Your noise should be now separated from your "not noise" and form cluster.
When you hover over the graph, in right-upper corner a set of icons appears. One is lasso selection. Use it to mark points, inspect spectrogram and e.g. download table with features that describe that signal. At that moment you can also reduce the noise (extra button appears) in a similar way to Audacity - it analyses the spectrum and reduces these frequencies with some smoothing.
It does not completely solve your problem right now, but could severely cut the effort. Going through hundreds of wavs could take better part of the day, but you will be done. Want it automated? There's CLI (command-line interface) that I am developing at the same time. In not-too-distant future it should take what you have labelled as noise and signal and then use supervised machine learning to go through everything in batch mode.
Suggestions / feedback? Drop an issue on GitHub.

Is it possible to, as accurately as possible, decompose an audio into MIDI, given the SoundFont that was used?

If I know the SoundFont that a MIDI to audio track has used, can I theoretically reverse the audio back into it's (most likely) MIDI components? If so, what would be one of the best approach to doing this?
The end goal is to try encoding audio (even voice samples) into MIDI such that I can reproduce the original audio in MIDI format better than, say, BearFileConverter. Hopefully with better results than just bandpass filters or FFT.
And no, this is not for any lossy audio compression or sheet transcription, this is mostly for my curiosity.
For monophonic music only, with no background sound, and if your SoundFont synthesis engine and your record sample rates are exactly matched (synchronized to 1ppm or better, have no additional effects, also both using a known A440 reference frequency, known intonation, etc.), then you can try using a set of cross correlations of your recorded audio against a set of synthesized waveform samples at each MIDI pitch from your a-priori known font to create a time line of statistical likelihoods for each MIDI note. Find the local maxima across your pitch range, threshold, and peak pick to find the most likely MIDI note onset times.
Another possibility is sliding sound fingerprinting, but at an even higher computational cost.
This fails in real life due to imperfectly matched sample rates plus added noise, speaker and room acoustic effects, multi-path reverb, and etc. You might also get false positives for note waveforms that are very similar to their own overtones. Voice samples vary even more from any template.
Forget bandpass filters or looking for FFT magnitude peaks, as this works reliably only for close to pure sinewaves, which very few musical instruments or interesting fonts sound like (or are as boring as).

Feeding real-time audio data to tensorflow on a mobile device

I am building a prototype of a sound detection app that will ultimately run on a phone (iPhone/Android). It needs to be near real-time to give fast enough response to the user when a particular sound is recognized. I am hoping to use tensorflow to actually build and train the model and then deploy it on mobile device.
What I am unsure about is best way to feed data to tensorflow for inference in this case.
Option 1: Feed only newly acquired samples to the model.
Here the model itself keeps a buffer of previous signal samples, to which new samples are appended and the whole thing get processed.
Something like:
samples = tf.placeholder(tf.int16, shape=(None))
buffer = tf.Variable([], trainable=False, validate_shape=False, dtype=tf.int16)
update_buffer = tf.assign(buffer, tf.concat(0, [buffer, samples]), validate_shape=False)
detection_op = ....process buffer...
session.run([update_buffer, detection_op], feed_dict={samples: [.....]})
This seems to work, but if the samples are pushed to the model 100 times a second, what's happening inside tf.assign (the buffer can grow big enough, and if tf.assign constantly allocates memory this may not work well)?
Option 2: Feed the whole recording to the model
Here the iPhone app keeps the state/recording samples, and feeds the whole recording to the model. The input can get quite large, and re-running the detection op on the whole recording will have to keep recomputing the same values each cycle.
Option 3: Feed a sliding window of data
Here the app keeps the data for the whole recording, but feeds only the latest slice of data to the model. E.g. last 2 sec at 2000 sampling rate == 4000 sample fed fed at the rate of 1/100 sec (each new 20 samples). The model may also need to keep some running totals for the whole recording.
Advise?
I'd need to know a bit more about your application requirements, but for simplicities sake I recommend starting with option #3. The usual way to approach this problem for arbitrary sounds is:
Have some trigger to detect the start of a sound or speech utterance. This can just be sustained audio levels, or something more advanced.
Run a spectrogram over a fixed size window, aligned with the start of the noise.
The rest of the network can just be a standard image detection one (usually cut down in size) to classify the sound.
There are a lot of variations and other possible approaches. For example for speech it's typical to use MFCC as your feature generator, and then run an LSTM to separate out phonemes, but since you mention sound detection I'm guessing you don't need anything this advanced.

Resources