Converting a spectrogram image back to audio - audio

I have generated some Mel-spectrograms using librosa to use it for generative adversarial networks(GANs). I have saved the generated spectrograms through GAN in image format(.png). Now I am trying to convert the images back to audio. Is it possible?

WaveNet can convert spectrograms back to speaking audio, a PyTorch version is available here

Related

Preparing image data to input into pre-built CNN

everyone.
I am trying to create a CNN which can upon being fed input of images classify which part of the image to focus upon. For that purpose, I have collected data by obtaining gaze data of humans for a given video and divided each video frame into 9 different areas. With the actual gaze data acting as the supervisory data, I am trying to make my system learn how to mimic a human's eye gaze.
For starters, I am using a pre-built CNN for the classification of the MNIST dataset using tensorflow. I am currently trying to make my dataset follow the format of MNIST dataset keras.datasets.mnist. I have video frames in .jpg format and the corresponding grid area as a NumPy array.
I am stuck on how to correctly label and format my images so that I can directly feed the image into the pre-built CNN. System I am using tensorflow 2.7.0, python 3.9.7 using conda.
Any help is very appreciated.

can I convert audio to MFCC as RGB image and then use the image in CNN for audio classification

I am currently working on audio speech classification and the audios length that I have vary between 5 sec up to 5 min, my question is can I convert my audio to MFCC as RGB image and then use CNN with softmax? does this sounds like a good idea?
This sounds rather convoluted ;)
You can skip the RGB part, and just pass the MFCC directly to the CNN.

Applying a pytorch CNN to video?

I am looking for some advice on how to apply a pytorch CNN to a video as opposed to an image.
Picture a drone flying over an area and using video to capture some objects below. I have a CNN trained on images of objects, and want to count the objects in the video.
Currently my strategy has been to convert the video to frames as PNGs and running the CNN on those PNGs. this seems inefficient, and I am struggling with how to count the objects without duplicating (frame 1 and frame 1+n will overlap).
It would be appreciated if someone had some advice, or a suggested tutorial/code set that did this. Thanks in advance.
PyTorch at the moment doesn't have support to detect and track objects in a video.
You would need to create your own logic for that.
The support is limited to read the video and audio from a file, read frames and timestamps, and write the video read more in here.
What you will basically need to do is to create an object tracking, frame by frame together by keeping their with their square positions and based on that decide if the same object or not.
If you have a drone flying and inspecting people you may check Kinetics to detect human actions:
ResNet 3D 18
ResNet MC 18
ResNet (2+1)D
All based on Kinetics-400
But the newer one is Kinetics-700.
try using torchvision and torch to recognize objects in a youtube video
https://dida.do/blog/how-to-recognise-objects-in-videos-with-pytorch

Can datasets with different format(.jpeg and .tif) be used together for training a CNN

I am currently having medical images from two sources. One is having JPEG format while other is having TIF format, TIF format is lossless while JPEG is lossy so if I convert TIF to JPEG there is a chance of data loss or can I mix both together and use it for training the CNN.
Using Keras with Tensorflow backend.
Neural networks, and Machine Learning models in general, do not take specific file formats as input, but expect matrices/tensors of real numbers as input. For RGB images this means a tensor with dimensions (width, height, 3). When the image is read from a file, its transformed automatically into a tensor, so it does not matter which kind of file format you use.

Convert audio stream into image

I'm working on a machine learning project so I have TONS of data. My model is to train based on audio data. Let's say it comes from mp3 or mp4.
Do you know any programs that do mass conversion from audio to some sort of image? Like a spectrogram?
I know there are programs that visualize individual files, but are there any that I can simply input a directory and it will convert all files?

Resources