How to examine the efficiency of the FEC (Forward Error Correction) algorithm implemented in Unet audio? - audio

What is the appropriate procedure to be followed to examine the efficiency or to identify limitations of the FEC algorithm, which is currently implemented in the Unet audio?
I wanted to analyze how efficiently the current FEC algorithm in the Unet audio works, specifically how many bits of errors it can correct and what its limitations are.
For that, I have tried to use BER (Bit Error Rate) as a metric to check the performance of FEC under various situations using the count of errors in the received data from the BER value.
I have used two laptops for this experiment as there is a very minimal probability of getting errors using a single laptop. I was able to observe the errors in the received data (using BER) when phy[].fec was set to zero, and when I turned on the FEC, it resolved all those errors (BER = 0/144). Then, I introduced some ambient noise by playing music while transmitting data so that it would interfere with the actual data and add some noise to it, simulating a realistic underwater scenario. This resulted in a lot of errors in the received data even after applying FEC.
Here are the screenshots of the two Unet audio shells used for data transmission. In this case, a standard test frame is transmitted to compute the BER for the frame (phy[].test is set to true):
The first TxFrameReq is sent without any noise, and the later one is sent in the presence of some noise.
As depicted in the images above, when some noise is added during data transmission, errors occur in the received message, and data is corrupted even though the FEC is set on.
So, I would like to know if the above method is an appropriate and justified one to analyze the efficiency and get the limitations of the FEC in the Unet audio. If otherwise, it would be really helpful if you could suggest a method to do the same.

What you have done is to enable test mode and disable FEC to measure BER without error correction. Then to enable FEC and check that the BER goes to 0. That is a reasonable way to check FEC performance if you can control the your environment well (so that the error distribution is stationary).
An alternative approach is to use the FECDecodeReq to pass in simulated received bits with controlled bit errors, and check if the FEC is able to recover it. To do this, you'll need to capture a frame encoded with a FEC, but decoded without FEC (but with frameLength set up to capture the received code word fully), with a high enough SNR that there are no errors. Then you can set up FEC and pass those bits in using FECDecodeReq, and it should generate a RxFrameReq for you. You can pass in bits with some errors, and it should correct those for you, yielding 0 BER for the RxFrameReq when error correction works.

Related

Detecting Human Voice within Gamemaker Audio data

I am writing a game that I need to filter out data coming from the microphone to determine if it contains human voice. The data is 16 bit PCM. Is there existing code out there that does this? Or at least some pseudocode that is close that could be implemented to do this?
I'd consider going with RNNoise, a noise suppression model. What is of interest to you is the VAD: Voice Activity Detector. For your purpose, you could discard the noise suppression part and use VAD alone. Values close to 1 indicate voice, 0 otherwise. It's a robust system with a very good performance.
You might consider modifying the neural network to only output VAD and simplifying the model accordingly. That's assuming you want

I need to analyse many audio WAV files for characteristic noise, ideas?

I need to be able to analyze (search thru) hundreds of WAV files and detect but not remove static noise. As done currently now, I must listen to each conversation and find the characteristic noise/static manually, which takes too much time. Ideally, I would need a program that can read each new WAV file and be able to detect characteristic signatures of the static noise such as periods of bursts of white noise or full audio band, high amplitude noise (like AM radio noise over phone conversation such as a wall of white noise) or bursts of peek high frequency high amplitude (as in crackling on the phone line) in a background of normal voice. I do not need to remove the noise but simply detect it and flag the recording for further troubleshooting. Ideas?
I can listen to the recordings and find the static or crackling but this takes time. I need an automated or batch process that can run on its own and flag the troubled call recordings (WAV files for a phone PBX). These are SIP and analog conversations depending on the leg of the conversation so RTSP/SIP packet analysis might be an option, but the raw WAV file is the simplest. I can use Audacity, but this still requires opening each file and looking at the visual representation of the audio spectrometry and is only a little faster than listening to each call but still cumbersome.
I currently have no code or methods for this task. I simply listen to each call wav file to find the noise.
I need a batch Wav file search that can render wav file recordings that contain the characteristic noise or static or crackling over the recording phone conversation.
Unless you can tell the program how the noise looks like, it's going to be challenging to run any sort of batch processing. I was facing a similar challenge and that prompted me to develop (free and open source) software to help user in audio exploration, analysis and signal separation:
App: https://audioexplorer.online/
Docs: https://tracek.github.io/audio-explorer/
Source code: https://github.com/tracek/audio-explorer
Essentially, it visualises audio as a 2d scatter plot rather than only "linear", as in waveform or spectrogram. When you upload audio the following happens:
Onsets are detected (based on high-frequency content algorithm from aubio) according to the threshold you set. Set it to None if you want all.
Per each audio fragment, calculate audio features based on your selection. There's no universal best set of features, all depends on the application. You might try for starter with e.g. Pitch statistics. Consider setting proper values for bandpass filter and sample length (that's the length of audio fragment we're going to use). Sample length could be in future established dynamically. Check docs for more info.
The result is that for each fragment you have many features, e.g. 6 or 60. That means we have then k-dimensional (where k is number of features) structure, which we then project to 2d space with dimensionality reduction algorithm of your selection. Uniform Manifold Approximation and Projection is a sound choice.
In theory, the resulting embedding should be such that similar sounds (according to features we have selected) are closely together, while different further apart. Your noise should be now separated from your "not noise" and form cluster.
When you hover over the graph, in right-upper corner a set of icons appears. One is lasso selection. Use it to mark points, inspect spectrogram and e.g. download table with features that describe that signal. At that moment you can also reduce the noise (extra button appears) in a similar way to Audacity - it analyses the spectrum and reduces these frequencies with some smoothing.
It does not completely solve your problem right now, but could severely cut the effort. Going through hundreds of wavs could take better part of the day, but you will be done. Want it automated? There's CLI (command-line interface) that I am developing at the same time. In not-too-distant future it should take what you have labelled as noise and signal and then use supervised machine learning to go through everything in batch mode.
Suggestions / feedback? Drop an issue on GitHub.

Feeding real-time audio data to tensorflow on a mobile device

I am building a prototype of a sound detection app that will ultimately run on a phone (iPhone/Android). It needs to be near real-time to give fast enough response to the user when a particular sound is recognized. I am hoping to use tensorflow to actually build and train the model and then deploy it on mobile device.
What I am unsure about is best way to feed data to tensorflow for inference in this case.
Option 1: Feed only newly acquired samples to the model.
Here the model itself keeps a buffer of previous signal samples, to which new samples are appended and the whole thing get processed.
Something like:
samples = tf.placeholder(tf.int16, shape=(None))
buffer = tf.Variable([], trainable=False, validate_shape=False, dtype=tf.int16)
update_buffer = tf.assign(buffer, tf.concat(0, [buffer, samples]), validate_shape=False)
detection_op = ....process buffer...
session.run([update_buffer, detection_op], feed_dict={samples: [.....]})
This seems to work, but if the samples are pushed to the model 100 times a second, what's happening inside tf.assign (the buffer can grow big enough, and if tf.assign constantly allocates memory this may not work well)?
Option 2: Feed the whole recording to the model
Here the iPhone app keeps the state/recording samples, and feeds the whole recording to the model. The input can get quite large, and re-running the detection op on the whole recording will have to keep recomputing the same values each cycle.
Option 3: Feed a sliding window of data
Here the app keeps the data for the whole recording, but feeds only the latest slice of data to the model. E.g. last 2 sec at 2000 sampling rate == 4000 sample fed fed at the rate of 1/100 sec (each new 20 samples). The model may also need to keep some running totals for the whole recording.
Advise?
I'd need to know a bit more about your application requirements, but for simplicities sake I recommend starting with option #3. The usual way to approach this problem for arbitrary sounds is:
Have some trigger to detect the start of a sound or speech utterance. This can just be sustained audio levels, or something more advanced.
Run a spectrogram over a fixed size window, aligned with the start of the noise.
The rest of the network can just be a standard image detection one (usually cut down in size) to classify the sound.
There are a lot of variations and other possible approaches. For example for speech it's typical to use MFCC as your feature generator, and then run an LSTM to separate out phonemes, but since you mention sound detection I'm guessing you don't need anything this advanced.

Audio signal source separation with neural network

What I am trying to do is separating the audio sources and extract its pitch from the raw signal.
I modeled this process myself, as represented below:
Each sources oscillate in normal modes, often makes its component peaks' frequency integer multiplication. It's known as Harmonic. And then resonanced, finally combined linearly.
As seen in above, I've got many hints in frequency response pattern of audio signals, but almost no idea how to 'separate' it. I've tried countless of my own models. This is one of them:
FFT the PCM
Get peak frequency bins and amplitudes.
Calculate pitch candidate frequency bins.
For each pitch candidates, using recurrent neural network analyze all the peaks and find appropriate combination of peaks.
Separate analyzed pitch candidates.
Unfortunately, I've got non of them successfully separates the signal until now.
I want any of advices to solve these kind of problem.
Especially in modeling of source separation like my one above.
Because no one has really attempted to answer this, and because you've marked it with the neural-network tag, I'm going to address the suitability of a neural network to this kind of problem. As the question was somewhat non-technical, this answer will also be "high level".
Neural networks require some sort of sample set from which to learn. In order to "teach" a neural net to solve this problem you would essentially need to have a working set of known solutions to work from. Do you have this? If so, read on. If not, a neural is probably not what you are seeking. You stated that you have "many hints" but no real solution. This leads me to believe you probably don't have sample sets. If you can get them, great, otherwise you might be out of luck.
Supposing now that you have a sample set of Raw Signal samples and corresponding Source 1 and Source 2 outputs... Well, now you're going to need a method for deciding on a topology. Assuming you don't know a lot about how neural nets work (and don't want to), and assuming you also don't know the exact degree of complexity of the problem, I would probably recommend the open source NEAT package to get you started. I am not affiliated in any way with this project, but I have used it, and it allows you to (relatively) intelligently evolve neural network topologies to fit the problem.
Now, in terms of how a neural net would solve this specific problem. The first thing that comes to mind is that all audio signals are essentially time-series. That is to say, the information they convey is somehow dependent and related to the data at previous timesteps (e.g. the detection of some waveform cannot be done from a single time-point; it requires information about previous timesteps as well). Again, there's a million ways of solving this problem, but since I'm already recommending NEAT I'd probably suggest you take a look at the C++ NEAT Time Series mod.
If you're going down this route, you'll probably be wanting to use some sort of sliding window to provide information about the recent past at each time step. For a quick and dirty intro to sliding windows, check out this SO question:
Time Series Prediction via Neural Networks
The size of the sliding window can be important, especially if you're not using recurrent neural nets. Recurrent networks allow neural nets to remember previous time steps (at the cost of performance - NEAT is already recurrent so that choice is made for you here). You will probably want the sliding window length (ie. the number of timesteps in the past provided at every time step) to be roughly equal to your conservative guess of the largest number of previous timesteps required to gain enough information to split your waveform.
I'd say this is probably enough information to get you started.
When it comes to deciding how to provide the neural net with the data, you'll first want to normalise the input signals (consider a sigmoid function) and experiment with different transfer functions (sigmoid would probably be a good starting point).
I would imagine you'll want to have 2 output neurons, providing normalised amplitude (which you would denormalise via the inverse of the sigmoid function) as the output representing Source 1 and Source 2 respectively. For the fitness value (the way you judge the ability of each tested network to solve the problem) would be something along the lines of the negative of the RMS error of the output signal against the actual known signal (ie. tested against the samples I was referring to earlier that you will need to procure).
Suffice to say, this will not be a trivial operation, but it could work if you have enough samples to train the network against. What is a good number of samples? Well as a rule of thumb it's roughly a number that is large enough such that a simple polynomial function of order N (where N is the number of neurons in the netural network you require to solve the problem) cannot fit all of the samples accurately. This is basically to ensure you are not simply overfitting the problem, which is a serious issue with neural networks.
I hope this has been helpful! Best of luck.
Additional note: your work to date wouldn't have been in vain if you go down this route. A neural network is likely to benefit from additional "help" in the form of FFTs and other signal modelling "inputs", so you might want to consider taking the signal processing you have already done, organising into an analog, continuous representation and feeding it as an input alongside the input signal.

Signal processing: FFT overlap processing resources

Are there any good (if possible scientific) resources available (web or books) about overlap processing. I am not that interested in the effects of using overlap processing and windows when analyzing a signal, since the requirements are different. It is more about the following Real Time situation: (I am currently dealing with audio signals)
Dividing a signal into smaller parts.
Creating overlap windows.
FFTing the windowed chunks.
Do processing in the frequency domain.
IFFT the results.
put the chunks together to a continuous stream.
I am especially interested in the influence of the window used on the resulting error as well as the effect of the overlap length. However I couldn't find any good resources that deal with the subject in detail. Any suggestions?
Edit:
After some discussions if using a window function is appropriate, I found a decent handout explaining the overlap and add/save method. http://www.ece.tamu.edu/~deepa/ecen448/handouts/08c/10_Overlap_Save_Add_handouts.pdf
However, after doing some tests, I noticed that the windowed version would perform more accurate in most cases than the overlap & add/save method. Could anybody confirm this?
I don't want to jump to any conclusions regarding computation time though....
Edit2:
Here are some graphs from my tests:
I created a signal, which consists of three cosine waves
I used this filter function in the time domain for filtering. (It's symmetric, as it is applied to the whole output of the FFT, which also is symmetric for real input signals)
The output of the IFFT looks like this: It can be seen that low frequencies are attenuated more than frequency in the mid range.
For the overlap add/save and the windowed processing I divided the input signal into 8 chunks of 256 samples. After reassembling them they look like that. (sample 490 - 540)
It can be seen that the overlap add/save processes differ from the windowed version at the point where chunks are put together (sample 511). This is the error which leads to different results when comparing windowed process and overlap add/save. The windowed process is closer to the one processed in one big junk.
However, I have no idea why they are there or if they shouldn't be there at all.
This is fairly well-known area of signal processing, and generally speaking if you are doing processing along the lines of FFT -> spectral processing -> IFFT you need to use the "overlap and add" approach. Cross-correlation of two inputs is a classic example, done much more easily in the spectral domain than the time domain.
Here's a short paper I found right away via Google (I just searched for "fft overlap and add"): http://www.coe.montana.edu/ee/rmaher/ee477/ee477_fftlab_sp07.pdf
I would recommend you invest in a good Signal Processing book, such as the classic Rabiner & Gold "Theory and application of digital signal processing" (Prentice-Hall ISBN 0-13-914101-4). That should cover the concept of overlap-and-add processing.
When using an FFT for overlap-add or overlap-save fast convolution filtering, normally you don't want to use a windowing function. The circular windowing artifacts cancel out when combining successive FFT frames in canonical overlap add/save filtering.
ADDED:
If you do use a non-rectangular window, you might want to make sure that all the overlapped frames of windows sum to DC, otherwise your resulting filtered signal will have amplitude scalloping. Rectangular windows and raised-cosine (von Hann) windows will sum to DC if the overlap amount is an exact submultiple of the window width (except, of course, at the very start and end of the overlap sequence).
I have been playing with this attempting to answer the question for myself as to why one would use a window. My only references to a synthesis window are this:
https://ccrma.stanford.edu/~jos/sasp/Inverse_FFT_Synthesis.html
http://recherche.ircam.fr/anasyn/roebel/amt_audiosignale/VL2.pdf
http://www.dspdimension.com/tutorials/
Stephan Bernsee has some good overview information. His smbpitchshift code uses a synthesis window -- He uses the raised cosine on the input block, then applies it again on the output block, but this I believe is necessary because the pitch shifting algorithm is not a linear filtering operation, so it is certain there may be discontinuous artifacts on the window boundaries, thus a synthesis window is used to create a smooth transition between frames.
I think the reason there is not much information specifically addressing windowing for frequency domain real-time convolution is because it doesn't have a practical application unless you also need to do some analysis (ie, and adaptive filter of some sort), then the topics related to spectral spreading is again of interest.
I have plotted outputs from a filtered signal using both a raised cosine window as well as overlap-add method, and the end result is an identical IR, and identical signals. It comes as no surprise since the same operations performed in the time domain yield the same results.
On the other hand, if I implement a broken filter kernel, a smooth windowing function can help mask artifacts. This in a sense windows the broken IR so there is a more cohesive transition between frames. It would still be better to have an IR that is limited to length nfft/2 in the time domain. If you need to obtain a filter response with an IR longer than nfft/2, then you should consider either using a larger FFT size (if latency is not a problem) or use a partitioned convolution scheme:
http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CB4QFjAA&url=http%3A%2F%2Fpcfarina.eng.unipr.it%2FPublic%2FPapers%2F164-Mohonk2001.PDF&ei=qtH0TorDEoKziQKAloHEDg&usg=AFQjCNGDmz79DiuG1kmPXifbWJ7M-gr9rQ&sig2=CMopEcGc1VArZ3gipWTr_w
or
http://www.music.miami.edu/programs/mue/Research/jvandekieft/jvchapter2.htm
I hope that is helpful to somebody reading this
I hope those links help, even though it doesn't directly address windowing as used in real-time Frequency domain filtering.

Resources