Feeding real-time audio data to tensorflow on a mobile device - audio

I am building a prototype of a sound detection app that will ultimately run on a phone (iPhone/Android). It needs to be near real-time to give fast enough response to the user when a particular sound is recognized. I am hoping to use tensorflow to actually build and train the model and then deploy it on mobile device.
What I am unsure about is best way to feed data to tensorflow for inference in this case.
Option 1: Feed only newly acquired samples to the model.
Here the model itself keeps a buffer of previous signal samples, to which new samples are appended and the whole thing get processed.
Something like:
samples = tf.placeholder(tf.int16, shape=(None))
buffer = tf.Variable([], trainable=False, validate_shape=False, dtype=tf.int16)
update_buffer = tf.assign(buffer, tf.concat(0, [buffer, samples]), validate_shape=False)
detection_op = ....process buffer...
session.run([update_buffer, detection_op], feed_dict={samples: [.....]})
This seems to work, but if the samples are pushed to the model 100 times a second, what's happening inside tf.assign (the buffer can grow big enough, and if tf.assign constantly allocates memory this may not work well)?
Option 2: Feed the whole recording to the model
Here the iPhone app keeps the state/recording samples, and feeds the whole recording to the model. The input can get quite large, and re-running the detection op on the whole recording will have to keep recomputing the same values each cycle.
Option 3: Feed a sliding window of data
Here the app keeps the data for the whole recording, but feeds only the latest slice of data to the model. E.g. last 2 sec at 2000 sampling rate == 4000 sample fed fed at the rate of 1/100 sec (each new 20 samples). The model may also need to keep some running totals for the whole recording.
Advise?

I'd need to know a bit more about your application requirements, but for simplicities sake I recommend starting with option #3. The usual way to approach this problem for arbitrary sounds is:
Have some trigger to detect the start of a sound or speech utterance. This can just be sustained audio levels, or something more advanced.
Run a spectrogram over a fixed size window, aligned with the start of the noise.
The rest of the network can just be a standard image detection one (usually cut down in size) to classify the sound.
There are a lot of variations and other possible approaches. For example for speech it's typical to use MFCC as your feature generator, and then run an LSTM to separate out phonemes, but since you mention sound detection I'm guessing you don't need anything this advanced.

Related

Scaling an image according to audio (threshold, frequencies)

I am looking for scaling a PNG file according to an audio provided, a frequency range (20hz-1000hz for example) and a threshold, for a smooth effect.
For example, when there is a kick, scale go to 120% smoothly, I would like to make those audio visualizers such as dubstep, etc... where when kicks comes in, their image are "pumping".
First, is it doable with ffmpeg?
Where to start?
I found showcqt that takes frequencies in input etc., but its output is a video so I don't think I can use it in my case. Any help appreciated.
If you are able to read the PCM values as they are being output, then you might consider using a rolling RMS average in order to get a continuous stream of amplitudes. IDK the best length of the array. Perhaps it should correspond to the number of audio frames that would give you an update for each visual frame? The folks at the DSP site would have the best insights.
If you do a rolling average, computations are not terribly expensive. You'd do the square on the incoming and add that to a ring buffer (circular queue) and drop the outgoing. Only those data points need be added to the rolling average when computing the new rolling average, since the denominator is fixed and known. I found a video that describes the basic RMS math here using Matlab.
It might be necessary to add some smoothing to visualizer that is receiving the volume updates. Also, handing off data from the audio thread should likely employ some form of loose coupling. It would not be good if the thread that is processing the audio was also handling graphics.
I'm a little over my head, but I think this is what is generally done for visualizers.

I need to analyse many audio WAV files for characteristic noise, ideas?

I need to be able to analyze (search thru) hundreds of WAV files and detect but not remove static noise. As done currently now, I must listen to each conversation and find the characteristic noise/static manually, which takes too much time. Ideally, I would need a program that can read each new WAV file and be able to detect characteristic signatures of the static noise such as periods of bursts of white noise or full audio band, high amplitude noise (like AM radio noise over phone conversation such as a wall of white noise) or bursts of peek high frequency high amplitude (as in crackling on the phone line) in a background of normal voice. I do not need to remove the noise but simply detect it and flag the recording for further troubleshooting. Ideas?
I can listen to the recordings and find the static or crackling but this takes time. I need an automated or batch process that can run on its own and flag the troubled call recordings (WAV files for a phone PBX). These are SIP and analog conversations depending on the leg of the conversation so RTSP/SIP packet analysis might be an option, but the raw WAV file is the simplest. I can use Audacity, but this still requires opening each file and looking at the visual representation of the audio spectrometry and is only a little faster than listening to each call but still cumbersome.
I currently have no code or methods for this task. I simply listen to each call wav file to find the noise.
I need a batch Wav file search that can render wav file recordings that contain the characteristic noise or static or crackling over the recording phone conversation.
Unless you can tell the program how the noise looks like, it's going to be challenging to run any sort of batch processing. I was facing a similar challenge and that prompted me to develop (free and open source) software to help user in audio exploration, analysis and signal separation:
App: https://audioexplorer.online/
Docs: https://tracek.github.io/audio-explorer/
Source code: https://github.com/tracek/audio-explorer
Essentially, it visualises audio as a 2d scatter plot rather than only "linear", as in waveform or spectrogram. When you upload audio the following happens:
Onsets are detected (based on high-frequency content algorithm from aubio) according to the threshold you set. Set it to None if you want all.
Per each audio fragment, calculate audio features based on your selection. There's no universal best set of features, all depends on the application. You might try for starter with e.g. Pitch statistics. Consider setting proper values for bandpass filter and sample length (that's the length of audio fragment we're going to use). Sample length could be in future established dynamically. Check docs for more info.
The result is that for each fragment you have many features, e.g. 6 or 60. That means we have then k-dimensional (where k is number of features) structure, which we then project to 2d space with dimensionality reduction algorithm of your selection. Uniform Manifold Approximation and Projection is a sound choice.
In theory, the resulting embedding should be such that similar sounds (according to features we have selected) are closely together, while different further apart. Your noise should be now separated from your "not noise" and form cluster.
When you hover over the graph, in right-upper corner a set of icons appears. One is lasso selection. Use it to mark points, inspect spectrogram and e.g. download table with features that describe that signal. At that moment you can also reduce the noise (extra button appears) in a similar way to Audacity - it analyses the spectrum and reduces these frequencies with some smoothing.
It does not completely solve your problem right now, but could severely cut the effort. Going through hundreds of wavs could take better part of the day, but you will be done. Want it automated? There's CLI (command-line interface) that I am developing at the same time. In not-too-distant future it should take what you have labelled as noise and signal and then use supervised machine learning to go through everything in batch mode.
Suggestions / feedback? Drop an issue on GitHub.

How should I handle large video datasets in Google Cloud ML Engine?

I am experimenting with video classification using Keras in Cloud ML Engine. My dataset consists in video sequences saved as separate images (eg. seq1_frame1.png, seq1.frame2.png...) which I have uploaded to a GCS bucket.
I use a csv file referencing the start of end frames of different subclips, and a generator which feeds batch of clips to the model. The generator is responsible for loading frames from the bucket, reading them as images, and concatenating them as numpy arrays.
My training is fairly long, and I suspect the generator is my bottleneck due to the numerous reading operations.
In the exemples I found online, people usually save pre-formatted clips as tfrecords files directly to GCS. I feel like this solution isn't ideal for very large datasets as it implies duplicating the data, even more so if we decide to extract overlapping subclips.
Is there something wrong in my approach ? And more importantly, is there a "golden-standard" for using large video datasets for machine learning ?
PS : I explained my setup for reference, but my question is not bound to Keras, generators or Cloud ML.
In this, you are almost always going to be trading time for space. You just have to work out which is more important.
In theory, for every frame, you have height*width*3 bytes. That's assuming 3 colour channels. One possible way you could save space is to use only one channel (probably choose green, or, better still, convert your complete dataset to greyscale). That would reduce your full size video data to one third size. Colour data in video tends to be at a lower resolution than luminance data so it might not affect your training, but it depends on your source files.
As you probably know, .png is a lossless image compression. Every time you load one, the generator will have to decompress first, and then concatenate to the clip. You could save even more space by using a different compression codec, but that would mean every clip would need full decompression and probably add to your time. You're right, the repeated decompression will take time. And saving the video uncompressed will take up quite a lot of space. There are places you could save space, though:
reduce to greyscale (or green scale as above)
temporally subsample frames (do you need EVERY consecutive frame, or could you sample every second one?)
do you use whole frames or just patches? Can you crop or rescale the video sequences?
are you using optical flow? It's pretty processor intensive, consider it as a pre-processing step, too, so you only have to do it once per clip (again this is trading space for time)

Is there a standard way to load/process (audio) data dynamically in tensorflow?

I'm building a network using the Nsynth dataset. It has some 22 Gb of data. Right now I'm loading everything into RAM but this presents some (obvious) problems.
This is an audio dataset and I want to window the signals and produce more examples changing the hop size for example, but because I don't have infinite amounts of RAM there are very little things I can do before I ran out of it (I'm actually only working with a very small subset of the dataset, don't tell google how I live).
Here's some code I'm using right now:
Code:
def generate_audio_input(audio_signal, window_size):
audio_without_silence_at_beginning_and_end = trim_silence(audio_signal, frame_length=window_size)
splited_audio = windower(audio_without_silence_at_beginning_and_end, window_size, hop_size=2048)
return splited_audio
start = time.time()
audios = StrechableNumpyArray()
window_size = 5120
pathToDatasetFolder = 'audio/test'
time_per_loaded = []
time_to_HD = []
for file_name in os.listdir(pathToDatasetFolder):
if file_name.endswith('.wav'):
now = time.time()
audio, sr = librosa.load(pathToDatasetFolder + '/' + file_name, sr=None)
time_to_HD.append(time.time()-now)
output = generate_audio_input(audio, window_size)
audios.append(np.reshape(output, (-1)))
time_per_loaded.append(time.time()-now)
audios = audios.finalize()
audios = np.reshape(audios, (-1, window_size))
np.random.shuffle(audios)
end = time.time()-start
print("wow, that took", end, "seconds... might want to change that to mins :)")
print("On average, it took", np.average(time_per_loaded), "per loaded file")
print("With an standard deviation of", np.std(time_per_loaded))
I'm thinking I could load only the filenames, shuffle those and then yield X loaded results for a more dynamical approach, but in that case I will still have all the different windows for a sound inside those X loaded results, giving me not a very good randomization.
I've also looked into TFRecords but I don't think that would improve anything from what I propose in the last paragraph.
So, to the clear question: Is there a standard way to load/process (audio) data dynamically in tensorflow?
I would appreciate it if the response is tailored to the particular problem I'm addressing of pre-processing my dataset before starting training.
I would also accept it if the answer is pre-process the data and save it into a TFRecord and then load the TFRecord, but I think that's sort of an overkill.
After discussing with some colleges during the last few months, I now think that the standard is indeed to use TFRecords. After making a few and understanding how to work with them I found several advantages and some drawbacks when using them with audio.
Advantages:
They completely all enqueuing issues with minimal strain on RAM.
There are solutions to load examples randomly. How many examples you load on RAM will depend on how frequently you want to go to the HD and how much information you want to load each time you access it.
They are easy to share and the pre-processing is (usually) already incorporated. You can have several processes using them or several people across different continents with a certainty that you are all using the same data. This is not true when working with raw audio and processing it on the fly as different software may apply computations differently (i.e. stft implementations may change soon).
Drawbacks:
They are too static. If you want to change your dataset in any way you need to create a new one. There is no way to modify every or any example. E.g., after a few iterations I decided to discard tensors with low amplitude. I could handle that in the code after loading a batch, but the only sensible way would be to discard the whole batch every time I found an outlier.
Creating them is a cumbersome and slow process. There is no way to start working with a TFRecord until it's complete. Additionally, if you decide to change the size of the tensors or the data type, you're going to have to make extra changes to your code and test them as some errors (e.g. data types) just pass silently.
Large on HD. Because TFRecords have examples that are feed directly into your network, they are not equivalent to raw audio files and you can not erase them. And because some of the examples in the TFRecord are product of data-augmentation techniques, they tend to be larger than the original files. (This last one is probably just a normal consequence of working with big datasets).
All in all, I think even though they are not tailored for audio and they are not very easy to implement at first, they are quite convenient and useful. Which is probably the reason why most people that work with big datasets and whom I've asked this question said they use them.

Changing duration of Guitar pluck in PCM data

Folks,
I am struggling with a simple concept related to the duration of play of PCM data. I would appreciate your feedback.
The application I am developing plays guitar notes from a music sheet.
I have implemented Jaffe-Smith Algorithm for guitar plucking.
https://ccrma.stanford.edu/~jos/Mohonk05/Extended_Karplus_Strong_EKS_Algorithm.html.
Let's say I compute samples for note A (440 Hz) for one second.
At the sample rate of 11025, I will be storing 11025 samples that can be send to the computer speakers as PCM audio.
For all the unique notes on the guitar, it takes quite some time to compute samples for all the notes. I am thinking I will pre-compute and save them as binary data and simply load them when the application is run.
So far so good.
Now, let's say I want to play a song (a list of various notes). Let's say the song needs to be played at 100 beats per minute. Let's say I have to play note A for one beat or 0.6 seconds (60/100).
Recalculating samples for 0.6 seconds may take quite some time.
Can I simply play (11025 * 0.6) samples? Will this create any side effect?
Is there a better way to achieve what I am trying to do?
Thank you in advance for your help.
Regards,
Peter
What you're basically trying to do is create a synthesized guitar, yes? I might suggest that you go with the sampler route instead.
By sample, I mean a small clip of audio (not a single sample in the sense of ADC or DAC).
Basically, you can flatten what you need into 4 parts:
Attack
Decay
Sustain
Release
These four parts work in that order, and are generally referred to as an ADSR envelope. The attack of the note is the initial sound. For a guitar, you are going to hear a pluck and the start of a pitch. The decay is going to be the sample of the string as it starts to fade away. The sustain is a sample repeated over and over again until you release the key. The release sample is what is played when you release the key. For a guitar, you might hear a sample of lightly putting fingers back on the string to stop their vibration.
Now, you could generate all of these samples in real-time, but will likely be very CPU intensive.
Regarding your question: "Can I simply play (11025 * 0.6) samples?" Yes, at a sample rate of 11025, that will be 0.6 seconds of audio. Also keep in mind though that you should be sending a continuous stream of data to the sound card, filling any empty spots with 0 (for signed PCM).

Resources