How many images(minimum) should be there in each classes for training YOLO? - conv-neural-network

I am trying to implement YOLOv2 on my custom dataset. Is there any minimum number of images required for each class?

There is no minimum images per class for training. Of course the lower number you have, the model will converge slowly and the accuracy will be low.
What important, according to Alexey's (popular forked darknet and the creator of YOLO v4) how to improve object detection is :
For each object which you want to detect - there must be at least 1
similar object in the Training dataset with about the same: shape,
side of object, relative size, angle of rotation, tilt, illumination.
So desirable that your training dataset include images with objects at
diffrent: scales, rotations, lightings, from different sides, on
different backgrounds - you should preferably have 2000 different
images for each class or more, and you should train 2000*classes
iterations or more
https://github.com/AlexeyAB/darknet
So I think you should have minimum 2000 images per class if you want to get the optimum accuracy. But 1000 per class is not bad also. Even with hundreds of images per class you can still get decent (not optimum) result. Just collect as many images as you can.

It depends.
There is an objective minimum of one image per class. That may work with some accuracy, in principle, if using data-augmentation strategies and fine-tuning a pretrained YOLO network.
The objective reality, however, is that you may need as many as 1000 images per class, depending on your problem.

Related

Can't overcome Overfitting - GrayScale Images from Numerical Arrays and CNN with PyTorch

I am trying to implement an image classification task for the grayscale images, which were converted from some sensor readings. It means that I had initially time series data e.g. acceleration or displacement, then I transformed them into images. Before I do the transformation, I did apply normalization across the data. I have a 1000x9 image dimension where 1000 represents the total time step and 9 is the number of data points. The split ratio is 70%, 15%, and 15% for training, validation, and test data sets. There are 10 different labels, each label has 100 images, it's a multi-class classification task.
An example of my array before image conversion is:
As you see above, the precisions are so sensitive. When I convert them into images, I am able to see the darkness and white part of the image;
Imagine that I have a directory from D1 to D9 (damaged cases) and UN (health case) and there are so many images like this.
Then, I have a CNN-network where my goal is to make a classification. But, there is a significant overfitting issue and whatever I do it's not working out. One of the architecture I've been working on;
Model summary;
I also augment the data. After 250 epochs, this is what I get;
So, what I wonder is that I tried to apply some regularization or augmentation but they do not give me kind of solid results. I experimented it by changing the number of hidden units, layers, etc. Would you think that I need to fully change my architecture? I basically consider two blocks of CNN and FC layers at the end. This is not the first time I've been working on images like this, but I cannot mitigate this overfitting issue. I appreciate it if any of you give me some solid suggestions so I can get smooth results. i was thinking to use some pre-trained models for transfer learning but the image dimension causes some problems, do you know if I can use any of those pre-trained models with 1000x9 image dimension? I know there are some overfiting topics in the forum, but since those images are coming from numerical arrays and I could not make it work, I wanted to create a new title. Thank you!

Increasing instances of a class with Data Augmentation

I am working with some classes of the Charades Dataset https://prior.allenai.org/projects/charades to detect indoor actions.
The structure of my dataset is as follows:
Where:
c025, c137 and c142 are actions;
XR436 has frames result of splitting a video where users are performing action c025 and the same for X3803, ... There is a total of 250 folders.
RI495 has frames result of splitting a video where users are performing action c137 and the same for DI402, ... There is a total of 40 folders.
TUCK3 has frames result of splitting a video where users are performing action c142 and the same for the rest. There is a total of 260 folders.
As you can see, the instances of class c137 are quite unbalanced with regard to class c025 and c142. Thus, i would like to increase the number of instances of this class using data augmentation. The idea is creating twin folders with certain transformations. For example, creating A4DID folder as a twin of RI495 with Equalization over each of the frames, A4456 folder as a twin of RI495 in GrayScale, ARTI3 as a twin of DI402 with rotation over the frames, etc. The pattern of transformations can be the same for every folder or not. Just interesting in augmenting the number of instances.
Do you know how to proceed? I am using Pytorch and I tried with torchvision.transforms and DataLoader from torch.utils.data but I have not achieved the result that I am looking for. Any idea on how to proceed?
PS: Undersampling of c025 and c142 is not an option, due to the classifier is not able to learn well with such limited amount of examples.
Thank you in advance
A few thoughts:
Standard practice is to use transforms dynamically; that is, each time a data example is loaded, a compose or sequential set of transform operations are applied with random parameter settings. Thus, each time the datum is loaded, the resulting x (inputs) are different. This can be achieved by defining a stack of transforms to apply to each data example as it is loaded in a pytorch dataset object (see here). This helps provide data augmentation.
Class imbalance is a somewhat different issue, and is generally solved by either a.) oversampling (this is acceptable if using the above transform solution because the oversampled examples will have different transforms applied) or b.) over-weighting of these examples in the loss calculation. Of course, neither approach can account for the risk of receiving an out-of-distribution testing example which is higher the fewer and less diverse examples you have for a given class. The former can be acheived by defining a custom Sampler object that yields examples from your dataset in a class-balanced manner. The latter can be achieved by passing weights to the loss function (many pytorch loss functions such as CrossEntropyLoss already support weights).

When and Whether should we normalize the ground-truth labels in the multi-task regression models?

I am trying a multi-task regression model. However, the ground-truth labels of different tasks are on different scales. Therefore, I wonder whether it is necessary to normalize the targets. Otherwise, the MSE of some large-scale tasks will be extremely bigger. The figure below is part of my overall targets. You can certainly find that columns like ASA_m2_c have much higher values than some others.
First, I have already tried some weighted loss techniques to balance the concentration of my model when it does gradient backpropagation. The result shows it didn't perform well.
Secondly, I have seen tremendous discussions regarding normalizing the input data, but hardly discovered any particular talking about normalizing the labels. It's partly because most of the people's problems are classification type and a single task. I do know pytorch provides a convenient approach to normalize the vision dataset by transform.normalize, which is still operated on the input rather than the labels.
Similar questions: https://forums.fast.ai/t/normalizing-your-dataset/49799
https://discuss.pytorch.org/t/ground-truth-label-normalization/26981/19
PyTorch - How should you normalize individual instances
Moreover, I think it might be helpful to provide some details of my model architecture. The input is first fed into a feature extractor and then several generators use the shared output representation from that extractor to predict different targets.
I've been working on a Multi-Task Learning problem where one head has an output of ~500 and another between 0 and 1.
I've tried Uncertainty Weighting but in vain. So I'd be grateful if you could give me a little clue about your studies.(If there is any progress)
Thanks.

How to use a trained deeplearning model at different resolutions?

I have trained a model for image segmentation task on 320x240x3 resolution images using tensorflow 2.x. I am wondering if there is a way to use the same model or tweak the model to make it work on different resolutions?
I have to use a model trained on a 320x240 resolution for Full HD (1920x1080) and SD(1280x720) images but as the GPU Memory is not sufficient to train the model at the specified resolutions with my architecture, I have trained it on 320x240 images.
I am looking for a scalable solution that works at all the resolutions. Any Suggestions?
The answer to your question is no: you cannot use a model trained at a particular resolution to be used at different resolution; in essence, this is why we train the models at different resolutions, to check the performance and possibly improve it.
The suggestion below omits one crucial aspect: that, depending on the task at hand, increasing the resolution can considerably improve the results in object detection and image segmentation, particularly if you have small objects.
The only solution for your problem, considering the GPU memory constraint, is to try to split the initial image into smaller parts (or maybe tiles) and train per part(say 320x240) and then reconstruct the initial image; otherwise, there is no other solution than to increase the GPU memory in order to train at higher resolutions.
PS: I understood your question after reading it a couple of times; I suggest that you modify a little bit the details w.r.t the resolution.
YEAH, you can do it in high resolution image. But the small resolution is easy to train and it is easy for the model to find the features of the image. Training in small resolution models saves your time and makes your model faster since it has the less number of parameters. HD images contains large amount of pixels, so if you train your model in higher resolution images, it makes your training and model slower as it contains large number of parameters due to the presence of higher number of pixels and it makes difficult for your model to find features in the high resolution image. So, mostly your are advisable to use lower resolution instead of higher resolution.

How do you decide on the dimensions of the convolutional neural filter?

If I have an image which is WxHx3 (RGB), how do I decide how big to make the filter masks? Is it a function of the dimensions (W and H) or something else? How does the dimensions of the second, third, ... filters compare to the dimensions of the first filter? (Any concrete pointers would be appreciated.)
I have seen the following, but they don't answer the question.
Dimensions in convolutional neural network
Convolutional Neural Networks: How many pixels will be covered by each of the filters?
How do you decide the parameters of a Convolutional Neural Network for image classification?
It would be great if you add details what are you trying to extract from the image and details of the dataset that you are trying to use.
A general assumption can be drawn from Alexnet and ZFnet about the filter mask sizes that are needed to be considered. There is no specific formulation which size should be considered for particular format but the size is kept low if a deeper analysis is required as many smaller details might miss with larger filter sizes. In the above link with Inception networks describes how effectively you can utilize the computing resources. If you dont have the issue of the resources, then from ZFNet you can observe the visualizations in multiple layers, there are many finer details visible. We can call it CNN even if it has one layer of convolution and pooling layer. The number of layers depends on the deep finer requirements.
I am not expert, but can recommend if your dataset is small as few thousands and not many features extraction is required, and if you are not sure about the size you can just simply go with the small sizes (small best and popular is 5x5 - Lenet5).

Resources