I have taken the LJ Speech dataset from Hugging Face for Automatic Speech Recognition Training.
Link to dataset: https://huggingface.co/datasets/lj_speech
The sampling rate of the audio is 22050 Hz.
I want to convert it into 16000 Hz for the whole dataset.
code and output
lj_data['audio'][0]
output
screenshot of audio file description
Actually, I found out the answer.
Hugging face has some amazing functions, which can resample the file.
from datasets import load_dataset, load_metric, Audio
#loading data
data = load_dataset("lj_speech")
#resampling training data from 22050Hz to 16000Hz
data['train'] = data['train'].cast_column("audio", Audio(sampling_rate=16_000))
To see documentation: https://huggingface.co/docs/datasets/audio_process.html
Results:
Before Resampling
Before Resampling- 22050Hz
After Resampling:
After Resampling - 16000Hz
Related
I have generated some Mel-spectrograms using librosa to use it for generative adversarial networks(GANs). I have saved the generated spectrograms through GAN in image format(.png). Now I am trying to convert the images back to audio. Is it possible?
WaveNet can convert spectrograms back to speaking audio, a PyTorch version is available here
I am currently working on audio speech classification and the audios length that I have vary between 5 sec up to 5 min, my question is can I convert my audio to MFCC as RGB image and then use CNN with softmax? does this sounds like a good idea?
This sounds rather convoluted ;)
You can skip the RGB part, and just pass the MFCC directly to the CNN.
We are training our Azure Cognitive Services Custom Speech model using data recorded in .wav (RIFF) format at 16bit, 16kHz as per the documentation.
But, we have obtained a dataset of speech recorded at 48kHz and encoded as MP3. Speech Studio seems to be able to train the service using this data without problems but we would like to know if doing so, with the higher sample rate, will only be of use in recognising streamed data also at the higher rate or does that not matter?
Having a higher sample rate like the one you described is desirable in terms of quality of the audio, but it generally won't influence speech recognition. As long as you meet the audio format minimum requirements, speech recognition should work just fine.
I am getting started with Google's Audioset. While the dataset is extensive, I find the information with regards to the audio feature extraction very vague. The website mentions
128-dimensional audio features extracted at 1Hz. The audio features were extracted using a VGG-inspired acoustic model described in Hershey et. al., trained on a preliminary version of YouTube-8M. The features were PCA-ed and quantized to be compatible with the audio features provided with YouTube-8M. They are stored as TensorFlow Record files.
Within the paper, the authors discuss using mel spectrograms on 960 ms chunks to get a 96x64 representation. It is then unclear to me how they get to the 1x128 format representation used in the Audioset. Does anyone know more about this??
They use the 96*64 data as input for a modified VGG network.The last layer of VGG is FC-128, so its output will be 1*128, and that is the reason.
The architecture of VGG can be found here: https://github.com/tensorflow/models/blob/master/research/audioset/vggish_slim.py
I'm able to get a reasonable level of accuracy with 8khz audio files. Now I want to try a higher sample rate, if I can.
Looking at the acoustic models available on this page, they list:
en-us-8khz.tar.gz
en-us-semi-full.tar.gz
en-us-semi.tar.gz
en-us.tar.gz
The one that says 8khz is obviously the one for the 8khz sample rate, but what about the other three? What sample rates do they match?
If I use a 16khz audio file, which of these acoustic models do I need to use?
And in the absense of the sample rate being in the file name, how do I figure out the sample rate of an acoustic model?
You can open the file feat.params in model folder and look for -upperf parameter. In 8khz model -upperf is usually 3500 or 4000. For 16khz model -upperf is more than 4000, usually 6800.