I'm trying to use OpenSmile as feature extractor(using emobase2010.conf) and do some classification with that features.
What i'm curious about is whether if i can use stream of audio already made to list as input(I'm using ROS communication to get audio stream).
At manual of openSMILE it only has example of using .wav as input.
Or is there anyway to extract 1582 features(like emobase2010.conf) from audio other then using openSMILE?
Related
We are training our Azure Cognitive Services Custom Speech model using data recorded in .wav (RIFF) format at 16bit, 16kHz as per the documentation.
But, we have obtained a dataset of speech recorded at 48kHz and encoded as MP3. Speech Studio seems to be able to train the service using this data without problems but we would like to know if doing so, with the higher sample rate, will only be of use in recognising streamed data also at the higher rate or does that not matter?
Having a higher sample rate like the one you described is desirable in terms of quality of the audio, but it generally won't influence speech recognition. As long as you meet the audio format minimum requirements, speech recognition should work just fine.
I have a 2 dimensional data in the form of a text file. I have to build a GMM based on this data using Sidekit 1.2.
Which function should I use to estimate the parameters of the Gaussian model (Mean, covariance matrix, weighted average etc.)
Can you please provide a small example with your own set of (x,y) data and build a GMM using that ?
Any help would be greatly appreciated.
Sidekit is a toolkit built mainly for the task of speaker recognition, and its framework (as other similar toolkits) relies on the training data consisting of audio files in the formats .wav, .sph or raw PCM.
If you're just building a GMM and don't plan to use it for speaker recognition experiments, I would recommend using another toolkit for general statistical purposes (scikit-learn might be a good choice).
If you do plan to do speaker recognition tasks, you will have to some initial work on your data. If your text-data is some form of speaker data, you could convert it to the appropriate format. For example, if the y part is raw audio, convert it to wav-files. If y is cepstral features or other features, store it in h5.-format. After doing this, you can build a GMM for speaker recognition tasks by following the tutorials on the Sidekit homepage.
I am getting started with Google's Audioset. While the dataset is extensive, I find the information with regards to the audio feature extraction very vague. The website mentions
128-dimensional audio features extracted at 1Hz. The audio features were extracted using a VGG-inspired acoustic model described in Hershey et. al., trained on a preliminary version of YouTube-8M. The features were PCA-ed and quantized to be compatible with the audio features provided with YouTube-8M. They are stored as TensorFlow Record files.
Within the paper, the authors discuss using mel spectrograms on 960 ms chunks to get a 96x64 representation. It is then unclear to me how they get to the 1x128 format representation used in the Audioset. Does anyone know more about this??
They use the 96*64 data as input for a modified VGG network.The last layer of VGG is FC-128, so its output will be 1*128, and that is the reason.
The architecture of VGG can be found here: https://github.com/tensorflow/models/blob/master/research/audioset/vggish_slim.py
Is there a way to just detect in realtime if there are multiple people speaking? Do I need a voice recognition api for that?
I don't want to separate the audio and I don't want to transcribe it either. My approach would be to frequently record using one mic (-> mono) and then analyse those recordings. But how then would I detect und distinguish voices? I'd narrow it down by looking only at relevant frequencies, but then...
I do understand that this is no trivial undertaking. That's why I do hope there's an api out there capable of doing this out of the box - preferably an mobile/web-friendly api.
Now this might sound like a shopping list for Christmas but as mentioned I do not need to know anything about the content. So my guess is that a full fledged speech recognition would have a high toll on the performance.
Most of similar problems (adult/children classifier, speech/music classifier, single voice / voice mixture classifier) are standard machine learning problems. You can solve them with classifier like GMM. You only need to construct training data for your task, so:
Take some amount of clean recordings, you can download audiobook
Prepare mixed data by mixing clean recordings
Train GMM classifier on both
Compare probabilities from clean speech GMM and mixed speech GMM and decide the presence of mixture by ratio of probabilities from two classifiers.
You can find some code samples here:
https://github.com/littleowen/Conceptor
For example you can try
https://github.com/littleowen/Conceptor/blob/master/Gender.ipynb
I'm working on a machine learning project so I have TONS of data. My model is to train based on audio data. Let's say it comes from mp3 or mp4.
Do you know any programs that do mass conversion from audio to some sort of image? Like a spectrogram?
I know there are programs that visualize individual files, but are there any that I can simply input a directory and it will convert all files?