I am applying deep learning algorithms to the speech commands dataset.
I am curious if normalization of the audio is needed before turning them into spectrograms or any other feature engineering thing?
I've gone through some notebooks on github that use this dataset and haven't found any clues, but as we use neural networks i think we need some normalization.
I have never worked with audio data so i am not very experienced.
Yes, normalizing data is recommended for neural network training.
Good explanation here - https://stats.stackexchange.com/q/458579/131706.
Related
I want to train and use an ML based personal voice to text converter for a highly impaired voice, for a small set of 300-400 words. This is to be used for people with voice impairment. But cannot be generic because each person will have a unique voice input for words, depending on their type of impairment.
Wanted to know if there are any ML engines which allow for such a training. If not, what is the best approach to go about it.
Thanks
Most of the speech recognition engines support training (wav2letter, deepspeech, espnet, kaldi, etc), you just need to feed in the data. The only issue is that you need a lot of data to train reliably (1000 of samples for each word). You can check Google Commands dataset for example of how to train from scratch.
Since the training dataset will be pretty small for your case and will consist of just a few samples, you can probably start with existing pretrained model and finetune it on your samples to get best accuracy. You need to look on "few short learning" setups.
You can probably look on wav2vec 2.0 pretrained model, it should be effective for such learning. You can find examples and commands for fine-tuning and inference here.
You can also try fine-tuning Japser models in Google Commands for NVIDIA NEMO. It might be a little less effective but could still work and should be easier to setup.
I highely recommend watching the youtube original series "The age of AI"'s First season, episode two.
Basically, google already done this for people who can't really form normal words with impared voice. It is very interesting and speaks a little bit about how they done and doing that with ML technologies.
enter link description here
I am looking to understand various spectrograms for audio analysis. I want to convert an audio file into 10 second chunks, generate spectrograms for each and use a CNN model to train on top of those images to see if they are good or bad.
I have looked at linear, log, mel, etc and read somewhere that mel based spectrogram is best to be used for this. But with no proper verifiable information. I have used the simple following code to generate mel spectrogram.
y,sr= librosa.core.load(r'C:\Users\Tej\Desktop\NoiseWork\NoiseOnly\song.wav')
S = librosa.feature.melspectrogram(y=y, sr=sr)
librosa.display.specshow(librosa.power_to_db(S, ref=np.max))
My question is which spectrogram best represents features of an audio file for training with CNN? I have used linear but some audio files the linear spectrogram seems to be the same
To add to what has been stated, I recommend reading through A Comparison of Audio Signal Preprocessing Methods for Deep Neural Networks on Music Tagging by Keunwoo Choi, György Fazekas, Kyunghyun Cho, and Mark Sandler.
For their data, they achieved nearly identical classification accuracy between simple STFTs and melspectrograms. So melspectrograms seem to be the clear winner for dimension reduction if you don't mind the preprocessing. The authors also found, as jonner mentions, that log-scaling (essentially converting amplitude to a db scale) improves accuracy. You can easily do this with Librosa (using your code) like this:
y,sr= librosa.core.load(r'C:\Users\Tej\Desktop\NoiseWork\NoiseOnly\song.wav')
S = librosa.feature.melspectrogram(y=y, sr=sr)
S_db = librosa.core.power_to_db(S)
As for normalization after db-scaling, that seems hit or miss depending on your data. From the paper above, the authors found nearly no difference using various normalization techniques for their data.
One last thing that should be mentioned is a somewhat new method called Per-Channel Energy Normalization. I recommend reading Per-Channel Energy Normalization: Why and How by Vincent Lostanlen, Justin Salamon, Mark Cartwright, Brian McFee,
Andrew Farnsworth, Steve Kelling, and Juan Pablo Bello. Unfortunately, there are some parameters that need adjusting depending on the data, but in many cases seems to do as well as or better than logmelspectrograms. You can implement it in Librosa like this:
y,sr= librosa.core.load(r'C:\Users\Tej\Desktop\NoiseWork\NoiseOnly\song.wav')
S = librosa.feature.melspectrogram(y=y, sr=sr)
S_pcen = librosa.pcen(S)
Although, like I mentioned, there are parameters within pcen that need adjusting! Here is Librosa's documentation on PCEN to get you started if you are interested.
Log-scaled mel-spectrograms is the current "standard" for use with Convolutional Neural Networks. It was the most commonly used in Audio Event Detection and Audio Scene Classification literature between 2015-2018.
To be more invariant to amplitude changes, normalized is usually applied. Either to entire clips or the windows being classified. Mean/std normalization works fine, generally.
But from the perspective of a CNN, there is relatively small difference between the different spectrometer variations. So this is unlikely to fix your issue if two or more spectrograms are basically the same.
The primary objective (my assigned work) is to do an image segmentation for the underwater images using a convolutional neural network. The camera shots taken from the underwater structure will have poor image quality due to severe noise and bad light exposure. In order to achieve higher classification accuracy, I want to do an automatic image enhancement for the images (see the attached file). So, I want to know, which CNN architecture will be best to do both tasks. Please kindly suggest any possible solutions to achieve the objective.
What do you need to segment? I'd be nice so see some labels of the segmentation.
You may not need to enhance the image, if all your dataset has that same amount of noise, the network will generalize properly.
Regarding CNNs architectures, it depends on the constraints you have with processing power and accuracy. If that is not a constrain go with something like MaskRCNN, check that repo as a good starting point, some results are like this:
Be mindful it's a bit of a complex architecture so inference times might be a bit too high (but it's doable on realtime depending your gpu).
Other simple architectures are FCN (Fully Convolutional Networks) with are basically your CNN but instead of fully connected layers:
You replace with with Fully Convolutional Layers:
Images taken from HERE.
The advantage of this FCNs are that they are really easy to implement and modify since you can go with simple architectures (FCN-Alexnet), to more complex and more accurate ones (FCN-VGG, FCN-Resnet).
Also, I think you don't mention framework, there are many to choose from and it depends on your familiarly with languages, most of them you can do them with python:
TensorFlow
Pytorch
MXNet
But if you are a beginner, try starting with a GUI based one, Nvidia Digits is a great starting point and really easy to configure, it's based on Caffe so it's fairly fast when deploying and can easily be integrated with accelerators like TensorRT.
What type of neural net architecture would one use to map sounds to other sounds? Neural nets are great at learning to go from sequences to other sequences, so sound augmentation/generation seems like it'd be a very popular application of them (but unfortunately, it's not - I could only find a (fairly old) magenta project dealing with it, and like, 2 other blog posts).
Assuming I have a sufficiently large dataset of input sounds / output sounds of the same length, how would i format the data? Perhaps train a CNN on spectrograms (something like cycleGAN or pix2pix), maybe use the actual data from the WAV file and use an LSTM? Is there some other type of weird architecture no one has heard about that's good for sound? Help me out please!
To anyone else doing a similar thing - the answer is using fast fourier transforms to get the data into a manageable state, and then people usually use RNNs or LSTMs to do stuff with the data - not CNNs.
I am planning on making an AI song composer that would take in a bunch of songs of one instrument, extract musical notes (like ABCDEFG) and certain features from the sound wave, preform machine learning (most likely through recurrent neural networks), and output a sequence of ABCDEFG notes (aka generate its own songs / music).
I think that this would be an unsupervised learning problem, but I am not really sure.
I figured that I would use recurrent neural networks, but I have a few questions on how to approach this:
- What features from the sound wave I should extract so that the output music is melodious?
Also, I have a few other questions as well
- Is it possible, with recurrent neural networks, to output a vector of sequenced musical notes (ABCDEF)?
- Any smart way I can feed in the features of the soundwaves as well as sequence of musical notes?
Well i did something similar once(making a shazam like app on matlab) , i think you can use FFT(Fast Fourier Transform ) to break it down into the constituent frequencies and their corresponding amplitudes .Then you can use the frequency range of different instruments to select them out of the whole bunch and classify .
I already tried something similar with an RNN (Recurrent Neural Network). Try using an LSTM network (Long Short Term Memory), they are a WAY better than RNNs for this type of data processing from what I read afterward, because they do not suffer from the "vanishing gradient problem".
What Chris Thaliyath said is a good hint on how to train the feature detector.