Convert audio stream into image - audio

I'm working on a machine learning project so I have TONS of data. My model is to train based on audio data. Let's say it comes from mp3 or mp4.
Do you know any programs that do mass conversion from audio to some sort of image? Like a spectrogram?
I know there are programs that visualize individual files, but are there any that I can simply input a directory and it will convert all files?

Related

Converting a spectrogram image back to audio

I have generated some Mel-spectrograms using librosa to use it for generative adversarial networks(GANs). I have saved the generated spectrograms through GAN in image format(.png). Now I am trying to convert the images back to audio. Is it possible?
WaveNet can convert spectrograms back to speaking audio, a PyTorch version is available here

Applying a pytorch CNN to video?

I am looking for some advice on how to apply a pytorch CNN to a video as opposed to an image.
Picture a drone flying over an area and using video to capture some objects below. I have a CNN trained on images of objects, and want to count the objects in the video.
Currently my strategy has been to convert the video to frames as PNGs and running the CNN on those PNGs. this seems inefficient, and I am struggling with how to count the objects without duplicating (frame 1 and frame 1+n will overlap).
It would be appreciated if someone had some advice, or a suggested tutorial/code set that did this. Thanks in advance.
PyTorch at the moment doesn't have support to detect and track objects in a video.
You would need to create your own logic for that.
The support is limited to read the video and audio from a file, read frames and timestamps, and write the video read more in here.
What you will basically need to do is to create an object tracking, frame by frame together by keeping their with their square positions and based on that decide if the same object or not.
If you have a drone flying and inspecting people you may check Kinetics to detect human actions:
ResNet 3D 18
ResNet MC 18
ResNet (2+1)D
All based on Kinetics-400
But the newer one is Kinetics-700.
try using torchvision and torch to recognize objects in a youtube video
https://dida.do/blog/how-to-recognise-objects-in-videos-with-pytorch

Using openSMILE with audio stream

I'm trying to use OpenSmile as feature extractor(using emobase2010.conf) and do some classification with that features.
What i'm curious about is whether if i can use stream of audio already made to list as input(I'm using ROS communication to get audio stream).
At manual of openSMILE it only has example of using .wav as input.
Or is there anyway to extract 1582 features(like emobase2010.conf) from audio other then using openSMILE?

Pocketsphinx cannot decode mfc file while pocketsphinx_continuous decodes corresponding wav

I have been working with CMUsphinx for Turkish language speech to text for couple months. I have succeeded to run a train on a 100 hours of sound. My target was to use the resulting Acoustic Model with Sphinx3 decoder. However Sphinx3 decoder cannot decode my test wav files. Then I have noticed that sphinxtrain runs pocketsphinx_batch in the end of training for testing the model.
So, I started working on poscketsphinx. I am at a point where pocketsphinx batch cannot decode a wav file (actually it only produces ııı nothing else) but pocketsphinx continuous produces more meaningful output with the same file (e.g. 10 correct words out of 15 words).
I guess I am missing some configuration steps. I have an compressed archive in this link
which includes the Acoustic and language models, dictionary and wav files I try to decode.
I am asking to get help for being able to use my model with Sphinx3 and Pocketsphinx_batch.
Thank you.
Fortunately I found the problem. It was feature vectors which are produced by sphinx_fe. I was creating them with default values. After reading the make_feats.pl and sphinxtrain.cfg files, I created feature vectors compatible with the Acoustic Model. Sphinxtrain.cfg has the lifter parameter as 22, but if we use sphinx_fe with default values lifter is 0, which means no lifter. I created mfc with lifter value 22 then it worked.

Small Data training in CMU Sphinx

I have installed sphinxbase, sphinxtrain and pocketsphinx in Linux (Ubuntu). Now I am trying to train data with speechcorps,transcriptions, dictionary etc obtained from VOXFORGE. (My etc and wav folder's data is obtained from VOXFORGE)
As I am new so I just want to train data and get some results with few line of transcripts and few wav files. let say 10 wav file and 10 transcript lines cosponsoring to it. Like this person in doing in this video
but when I run sphinxtrain then I am getting error.
Estimated Total Hours Training: 0.07021431623931
This is a small amount of data, no comment at this time
If I do CFG_CD_TRAIN= no I dont know what it means.
What changes I need to make? So I am able to remove this error.
PS: I can not add more data because I want to see some results first for my better understanding the whole scenario.
Not enough data for the training, we can only train CI models
You need at least 30 minutes of audio data to train CI models. Alternatively, you can set CFG_CD_TRAIN to "no".

Resources