CNN multi-class network - keras

what approach should i take when I want my CNN multi-class network to output something like [0.1, 0,1] when image doesn't belong
to any class. Using softmax and categorical_crossentropy for multi-class would give me output that sums up to 1 so still not what I want.
I'm new to neural networks so sorry for silly question and thanks in advance for any help.

I think you are gonna think about Bayesian Learning. First, talking about uncertainty.
For example, given several pictures of dog breeds as training data—when a user uploads a photo of his dog—the hypothetical website should return a prediction with rather high confidence. But what should happen if a user uploads a photo of a cat and asks the website to decide on a dog breed?
The above is an example of out of distribution test data. The model has been trained on photos of dogs of different breeds, and has (hopefully) learnt to distinguish between them well. But the model has never seen a cat before, and a photo of a cat would lie outside of the data distribution the model was trained on. This illustrative example can be extended to more serious settings, such as MRI scans with structures a diagnostics system has never observed before, or scenes an autonomous car steering system has never been trained on.
A possible desired behaviour of a model in such cases would be to return a prediction (attempting to extrapolate far away from our observed data), but return an answer with the added information that the point lies outside of the data distribution. We want our model to possess some quantity conveying a high level of uncertainty with such inputs (alternatively, conveying low confidence).
Then, I think you could read briefly this paper when they also apply to classification task and generate uncertainty for classes (dog, cat...). From this paper, you can extend your finding to application using this paper, and I think you will find what you want.

Related

What does "fine-tuning of a BERT model" refer to?

I was not able to understand one thing , when it says "fine-tuning of BERT", what does it actually mean:
Are we retraining the entire model again with new data.
Or are we just training top few transformer layers with new data.
Or we are training the entire model but considering the pretrained weights as initial weight.
Or there is already few layers of ANN on top of transformer layers which is only getting trained keeping transformer weight freeze.
Tried Google but I am getting confused, if someone can help me on this.
Thanks in advance!
I remember reading about a Twitter poll with similar context, and it seems that most people tend to accept your suggestion 3. (or variants thereof) as the standard definition.
However, this obviously does not speak for every single work, but I think it's fairly safe to say that 1. is usually not included when talking about fine-tuning. Unless you have vast amounts of (labeled) task-specific data, this step would be referred to as pre-training a model.
2. and 4. could be considered fine-tuning as well, but from personal/anecdotal experience, allowing all parameters to change during fine-tuning has provided significantly better results. Depending on your use case, this is also fairly simple to experiment with, since freezing layers is trivial in libraries such as Huggingface transformers.
In either case, I would really consider them as variants of 3., since you're implicitly assuming that we start from pre-trained weights in these scenarios (correct me if I'm wrong).
Therefore, trying my best at a concise definition would be:
Fine-tuning refers to the step of training any number of parameters/layers with task-specific and labeled data, from a previous model checkpoint that has generally been trained on large amounts of text data with unsupervised MLM (masked language modeling).

When and Whether should we normalize the ground-truth labels in the multi-task regression models?

I am trying a multi-task regression model. However, the ground-truth labels of different tasks are on different scales. Therefore, I wonder whether it is necessary to normalize the targets. Otherwise, the MSE of some large-scale tasks will be extremely bigger. The figure below is part of my overall targets. You can certainly find that columns like ASA_m2_c have much higher values than some others.
First, I have already tried some weighted loss techniques to balance the concentration of my model when it does gradient backpropagation. The result shows it didn't perform well.
Secondly, I have seen tremendous discussions regarding normalizing the input data, but hardly discovered any particular talking about normalizing the labels. It's partly because most of the people's problems are classification type and a single task. I do know pytorch provides a convenient approach to normalize the vision dataset by transform.normalize, which is still operated on the input rather than the labels.
Similar questions: https://forums.fast.ai/t/normalizing-your-dataset/49799
https://discuss.pytorch.org/t/ground-truth-label-normalization/26981/19
PyTorch - How should you normalize individual instances
Moreover, I think it might be helpful to provide some details of my model architecture. The input is first fed into a feature extractor and then several generators use the shared output representation from that extractor to predict different targets.
I've been working on a Multi-Task Learning problem where one head has an output of ~500 and another between 0 and 1.
I've tried Uncertainty Weighting but in vain. So I'd be grateful if you could give me a little clue about your studies.(If there is any progress)
Thanks.

Changes in GPT2/GPT3 model during few shot learning

During transfer learning, we take a pre-trained network and some observation pair (input and label), and use these data to fine-tune the weight by use of backpropagation. However, during one shot/few shot learning, according to this paper- 'Language Models are Few-Shot Learners' (https://arxiv.org/pdf/2005.14165.pdf), "No gradient updates are performed". Then what changes happen to the models like GPT2 and GPT3 during one shot/few shot learning?
Then what changes happen to the models like GPT2 and GPT3 during one shot/few shot learning?
There is no change to the model at all. The model does not learn anything preservably. What they do is give the "training examples" as context to the model and the model generates an output at the end of this context. Figure 2.1 (Brown, Tom B., et al. "Language models are few-shot learners."(2020).) shows examples of input for the fine-tuning, zero-shot-learning and few-shot-learning.
As you see, the training examples are part of the input and must be given each time a prediction shall be done. Therefore no change happened to the model.
Brown, Tom B., et al. "Language models are few-shot learners."(2020)
You may think that there are some changes because the model returns better results in the case of a few-shot training. However, it is the same model but having a different context as an input. GPT-2 and GPT-3 both are auto-regressive models meaning that the output also depends on the context.
More examples would mean a more clear context and, thus, the chance to obtain the desired results increases.

Retraining Tensorflow for new class labels

I am building a classifier that predicts the damage of a vehicle(like high, low, medium, good). I referred to this GitHub repository
https://github.com/raviranjan0309/Car-Damage-Detector
There is a retrained_label.txt file in models/tf_files which consist of four classes
not,
car,
high,
low
I do not want these four classes and I want my tf to predict one of the following
Good,
High Damage,
Low Damage,
Medium Damage
Is this possible ?
Should I need to retrain the tf for these classes ?
If so how ?
Thanks
The file you mentioned only has 4 words in it and to be honest it is difficult to understand why they are in that file.
Normally, for any tensorflow related analysis, you have to retrain the algorithm to be able to predict based on new labels.
If you are new to ML/DL and Tensorflow, I would suggest looking into excellent tutorials on Titanic predictors where you can use a simple database to predict either one of two outcomes: survive or die
You can then use a similar code and just use a different dataset (in this case I guess a car dataset) to have it predict one of four possible outcomes for damage. The only problem is getting that dataset of course
(many examples, but here's one: https://towardsdatascience.com/predicting-the-survival-of-titanic-passengers-30870ccc7e8)
Without having at least a 1000 or so data point with car information where that damage is already listed, it would be quite challenging.
So just to summarize:
1) yes you have to retrain and probably need a different dataset too
2) you may be able to create a dataset with damage info based on what you already have
3) once training/testing sets are ready, you can then retrain using simple ML techniques

Where do the input filters come from in conv-neural nets (MNIST Example)

I am a newby to the convolutional neural nets... so this may be an ignorant question.
I have followed many examples and tutorials now on the MNIST example in TensforFlow. In the CNN examples, all authors talk bout using the 'input filters' to run in the CNN. But no one that I can find mentions WHERE they come from. Can anyone answer where these come from? Or are they magically obtained from the input images.
Thanks! Chris
This is an image that one professor uses, be he does not exaplain if he made them or TensorFlow auto-extracts these somehow.
Disclaimer: I am not an expert, more of an enthusiast.
To cut a long story short: filters are the CNN equivalent of weights, and all a neural network essentially does is learning their optimal values.
Which it does by iterating through a training dataset, making predictions, comparing them to the label/value already assigned to each training unit (usually an image in case of a CNN) and adjusting weights to minimize the error function (the difference between the predicted value and the actual value).
Initial values of filters/weights do not matter that much, so although they might affect the speed of convergence to a small degree, I believe they are often assigned random values.
It is the job of the neural network to figure out the optimal weights, not of the person implementing it.

Resources