Applying a pytorch CNN to video? - pytorch

I am looking for some advice on how to apply a pytorch CNN to a video as opposed to an image.
Picture a drone flying over an area and using video to capture some objects below. I have a CNN trained on images of objects, and want to count the objects in the video.
Currently my strategy has been to convert the video to frames as PNGs and running the CNN on those PNGs. this seems inefficient, and I am struggling with how to count the objects without duplicating (frame 1 and frame 1+n will overlap).
It would be appreciated if someone had some advice, or a suggested tutorial/code set that did this. Thanks in advance.

PyTorch at the moment doesn't have support to detect and track objects in a video.
You would need to create your own logic for that.
The support is limited to read the video and audio from a file, read frames and timestamps, and write the video read more in here.
What you will basically need to do is to create an object tracking, frame by frame together by keeping their with their square positions and based on that decide if the same object or not.
If you have a drone flying and inspecting people you may check Kinetics to detect human actions:
ResNet 3D 18
ResNet MC 18
ResNet (2+1)D
All based on Kinetics-400
But the newer one is Kinetics-700.

try using torchvision and torch to recognize objects in a youtube video


Preparing image data to input into pre-built CNN

I am trying to create a CNN which can upon being fed input of images classify which part of the image to focus upon. For that purpose, I have collected data by obtaining gaze data of humans for a given video and divided each video frame into 9 different areas. With the actual gaze data acting as the supervisory data, I am trying to make my system learn how to mimic a human's eye gaze.
For starters, I am using a pre-built CNN for the classification of the MNIST dataset using tensorflow. I am currently trying to make my dataset follow the format of MNIST dataset keras.datasets.mnist. I have video frames in .jpg format and the corresponding grid area as a NumPy array.
I am stuck on how to correctly label and format my images so that I can directly feed the image into the pre-built CNN. System I am using tensorflow 2.7.0, python 3.9.7 using conda.
Any help is very appreciated.

Which Spectrogram best represents features of an audio file for CNN based model?

I am looking to understand various spectrograms for audio analysis. I want to convert an audio file into 10 second chunks, generate spectrograms for each and use a CNN model to train on top of those images to see if they are good or bad.
I have looked at linear, log, mel, etc and read somewhere that mel based spectrogram is best to be used for this. But with no proper verifiable information. I have used the simple following code to generate mel spectrogram.
y,sr= librosa.core.load(r'C:\Users\Tej\Desktop\NoiseWork\NoiseOnly\song.wav')
S = librosa.feature.melspectrogram(y=y, sr=sr)
librosa.display.specshow(librosa.power_to_db(S, ref=np.max))
My question is which spectrogram best represents features of an audio file for training with CNN? I have used linear but some audio files the linear spectrogram seems to be the same
To add to what has been stated, I recommend reading through A Comparison of Audio Signal Preprocessing Methods for Deep Neural Networks on Music Tagging by Keunwoo Choi, György Fazekas, Kyunghyun Cho, and Mark Sandler.
For their data, they achieved nearly identical classification accuracy between simple STFTs and melspectrograms. So melspectrograms seem to be the clear winner for dimension reduction if you don't mind the preprocessing. The authors also found, as jonner mentions, that log-scaling (essentially converting amplitude to a db scale) improves accuracy. You can easily do this with Librosa (using your code) like this:
y,sr= librosa.core.load(r'C:\Users\Tej\Desktop\NoiseWork\NoiseOnly\song.wav')
S = librosa.feature.melspectrogram(y=y, sr=sr)
S_db = librosa.core.power_to_db(S)
As for normalization after db-scaling, that seems hit or miss depending on your data. From the paper above, the authors found nearly no difference using various normalization techniques for their data.
One last thing that should be mentioned is a somewhat new method called Per-Channel Energy Normalization. I recommend reading Per-Channel Energy Normalization: Why and How by Vincent Lostanlen, Justin Salamon, Mark Cartwright, Brian McFee,
Andrew Farnsworth, Steve Kelling, and Juan Pablo Bello. Unfortunately, there are some parameters that need adjusting depending on the data, but in many cases seems to do as well as or better than logmelspectrograms. You can implement it in Librosa like this:
y,sr= librosa.core.load(r'C:\Users\Tej\Desktop\NoiseWork\NoiseOnly\song.wav')
S = librosa.feature.melspectrogram(y=y, sr=sr)
S_pcen = librosa.pcen(S)
Although, like I mentioned, there are parameters within pcen that need adjusting! Here is Librosa's documentation on PCEN to get you started if you are interested.
Log-scaled mel-spectrograms is the current "standard" for use with Convolutional Neural Networks. It was the most commonly used in Audio Event Detection and Audio Scene Classification literature between 2015-2018.
To be more invariant to amplitude changes, normalized is usually applied. Either to entire clips or the windows being classified. Mean/std normalization works fine, generally.
But from the perspective of a CNN, there is relatively small difference between the different spectrometer variations. So this is unlikely to fix your issue if two or more spectrograms are basically the same.

How feature map in Keras ConvNet represent features?

I know that it might be a dumb question, but I searched everywhere for an answer but I could not get.
Okay first properly explaining my question,
When I was learning CNN I was told that kernels or filters or activation map represent a feature of image.
To be specific, assume a cat image identification, a feature map would represent a "whiskers"
and in images which the activation of this feature map would be high it is inferred as whisker is present in image and so the image is a cat. (Correct me if I am wrong)
Well now when I made a Keras ConvNet I save the model
and then loaded the model and
saved all the filters to png images.
What I saw was 3x3 px images where each each pixel was of different colour (green, blue or their various variants and so on)
So how these 3x3px random colour pattern images of kernels represent in any way the "whisker" or any other feature of cat?
Or how could I know which png images is which feature ie which is whisker detector filter etc?
I am asking this because I might be asked in oral examination by teacher.
Sorry for the length of answer (but I had to make it so to explain properly)
You need to have a further look into how convolutional neural networks operate: the main topic being the convolution itself. The convolution occurs with the input image and filters/kernels to produce feature maps. A feature map is what may highlight important features.
The filters/kernels do not know anything of the input data so when you save these you are only going to see psuedo-random images.
Put simply, where * is the convolution operator,
input_image * filter = feature map
What you want to save, if you want to vizualise what is occuring during convolution, are the feature maps. This website gives a very detailed account on how to do so, and it is the method I have used in the past.

Keras data augmentation with change in outputs

I want to do regression with images. There are images of roads and the associated steering angle. As I want to apply data augmentation in Keras I would like to flip the input images horizontally but that would imply that the steering angle has to change its sign if the image is flipped. As far as I can see the documentation does not cover this problem. Is there a tutorial explaining how this can be achieved?
You have to write your own data-generator.
Check out the ImageLoader class (custom image generator) in my code here:

How i can to load own image to network in python?

I have made a convolutional neural network to mnist data. Now I want to change the input to my image. How can I do it? need to save the picture in a specific format?In addition, how save all picture and train one after the other?I use in tensorflow with python.
Tensorflow has support for bmp, gif, jpeg and png out of the box.
So load the data (read the file into memory as a 0D tensor of type string) then pass it to tf.image.decode_image or one of the specialized functions if it doesn't work for some reason.
You should get back the image as a tensor of shape [width, height, channels] (channels might be missing if you only have a single channel image, like grayscale).
To make this work nice you should have all the images in the same format. If you can load all the images into ram and pass them in bulk go for it since it's probably the easiest thing to do. Next easiest thing would be to copy the images into tensorflow.Example and to tf.TFRecordReader to do the shuffling and batching. If all else fails I think you can setup the input functions to read the images on demand and pipe them through the batching mechanism but I'm not sure how I would do that.
Here's a link to the tensorflow documentation related to images.
