I want to train a model that perform a few-shot image classification using CIFAR-10. So I have to train the model with a small amount of classes and use the rest of the classes for the testing. I'm wondering if I have only 10 classes, how can i do the split? (For example 6 classes for training and 4 for testing, is it ok?)
I have not completely understand your question. We have three options to say it crudely:
Training a model from scratch
Fine-tuning a model
Use a pre-trained model and use it for few-shot learning
If it is the number 3 is what you are looking for, I would recommend that you look into models like CLIP and see whether they work out for your use case.
Related
So if I understand correctly there are mainly two ways to adapt BERT to a specific task: fine-tuning (all weights are changed, even pretrained ones) and feature-based (pretrained weights are frozen). However, I am confused.
When to use which one? If you have unlabeled data (unsupervised learning), should you then use fine-tuning?
If I want to fine-tuned BERT, isn't the only option to do that using masked language model and next sentence prediction? And also: is it necessary to put another layer of neural network on top?
Thank you.
Your first approach should be to try the pre-trained weights. Generally it works well. However if you are working on a different domain (e.g.: Medicine), then you'll need to fine-tune on data from new domain. Again you might be able to find pre-trained models on the domains (e.g.: BioBERT).
For adding layer, there are slightly different approaches depending on your task. E.g.: For question-answering, have a look at TANDA paper (Transfer and Adapt Pre-Trained Transformer Models for Answer Sentence Selection). It is a very nice easily readable paper which explains the transfer and adaptation strategy. Again, hugging-face has modified and pre-trained models for most of the standard tasks.
I have a query regarding the extraction of VGG16/VGG19 features for my experiments.
The pre-trained VGG16 and VGG19 models have been trained on ImageNet dataset having 1000 classes (say c1,c2, ... c1000) and normally we extract the features from first and second fully connected layers designated ('FC1' and 'FC2'); these 4096 dimensional feature vectors are then used for computer vision tasks.
My question is that can we use these networks to extract features of an image that does not belong to any of the above 1000 classes ? In other words, can we use these networks to extract features of an image with label c1001 ? Remember that c1001 does not belong to the Imagenet classes on which these networks were initially trained on.
In the article available on https://www.pyimagesearch.com/2019/05/20/transfer-learning-with-keras-and-deep-learning/, I am quoting the following -
When performing feature extraction, we treat the pre-trained network
as an arbitrary feature extractor, allowing the input image to
propagate forward, stopping at pre-specified layer, and taking the
outputs of that layer as our features
From the above text, there is no restriction to whether the image must necessarily belong to one of the Imagenet classes.
Kindly spare some time to uncover this mystery.
In the research papers, the authors simply state that they have used features extracted from VGG16/VGG19 network pre-trained on Imagenet dataset without giving any further details.
I am giving a case study for reference:
Animal with Attribute dataset (see https://cvml.ist.ac.at/AwA2/) is a very popular dataset with 50 animal classes for image recognition task. The authors have extracted ILSVRC-pretrained ResNet101 features for the above dataset images. This ResNet 101 network has been pre-trained on 1000 imagenet classes (different imagenet classes are available at https://gist.github.com/yrevar/942d3a0ac09ec9e5eb3a#file-imagenet1000_clsidx_to_labels-txt).
Also, the AWA classes are put as follows:
antelope, grizzly+bear, killer+whale, beaver, dalmatian, persian+cat, horse
german+shepherd, blue+whale, siamese+cat, skunk, mole, tiger, hippopotamus, leopard, moose, spider+monkey, humpback+whale, elephant, gorilla, ox, fox, sheep
seal, chimpanzee, hamster, squirrel, rhinoceros, rabbit, bat, giraffe, wolf, chihuahua, rat, weasel, otter, buffalo, zebra, giant+panda, deer, bobcat, pig, lion, mouse, polar+bear, collie, walrus, raccoon, cow, dolphin
Now, if we compare the classes in the dataset with 1000 Imagenet classes, we find that classes like dolphin, cow, racoon, bobcat, bat, seal, sheep, horse, grizzly bear, giraffe etc are not there in the Imagenet and still the authors went on with extracting ResNet101 features. I believe that the features extracted are generalizable and that is why authors consider these features as meaningful representations for the AWA images.
Your take on this ?
The idea is to get the representations for the images not belonging to ImageNet classes and use them along with their labels in some other classifier.
Yes, you can, but.
Features in first fully-connected layers suppose to encode very general patterns, like angles, lines, and simple shapes. You can assume those can be generalized outside the class set it was trained on.
There is one But, however - those features were found as to minimize error on that particular classification task with 1000 classes. It means, that there can be no guarantee that they are helpful for classifying arbitrary class.
For only extracting the features, you can input any image you want in your pretrained VGG/other CNN. However, for the purpose of training, you have to implement other steps as stated below.
The features that are extracted have been determined by means of exclusively training on those 1000 classes and belong to those 1000 classes. You can use your network to predict on images that do not belong to those 1000 classes, but in the paragraphs below I explain why this is not the desired approach.
The key point to outline here is that, the set features that were extracted can be used to detect/determine the presence of other objects within a photo, but not "ready"/"out of the box".
For example, edges and lines are features that are not related exclusively to those 1000 classes, but also to other ones, hence they are useful, general features.
Therefore, you can employ "transfer learning", to train on your own images (dataset), for example c1001, c1002, c1003.
Notice however that you need to train on your own set before you can use the network to predict on your new images(new classes). Transfer learning refers to using the set of already gathered/learned features, which can be suitable to apply on another problem, but you need to train on your "new problem", say c1001, c1002, c1003.
For Image classification you may need to fine tune the model using relevant classes for c1001 class label.
But if you are planning to use it for unsupervised learning and using it for feature extraction part only, then there is no need to retrain the model. You can use existing pre-trained weights from ImageNet and extract feature then using that weights as VGG16/19 will generalize lower level feature in its initial layers and last few layers are only used for classification purpose.
So basically pretrained model can be used for unsupervised and feature extraction purpose without retraining.
Resnet50 is cool when we need classify different objects, say tree, dogs, tampons etc. But what if we want further classify say types of trees, or icecreams(Cone, candystick, cup) using ResNet50. Is there a way this would work? PyTorch answers are also welcome.
Yes it is possible.
ResNet50 is just an architecture of a artificial neural network. What you want to classify depends on the training data you feed it or the data it was trained on if you use pretrained weights.
If you want to classify types of trees, you would need to create (or find) a data set that show different type of trees with the appropriate label. Then you can train on the different tree types.
I suggest that you go through some tutorials, as explaining the whole process of data collection, data preprocessing, data annotation, and training an ANN or classic machine learning models would be a bit much here.
Best of luck
I want to fine tune BERT on a specific domain. I have texts of that domain in text files. How can I use these to fine tune BERT?
I am looking here currently.
My main objective is to get sentence embeddings using BERT.
The important distinction to make here is whether you want to fine-tune your model, or whether you want to expose it to additional pretraining.
The former is simply a way to train BERT to adapt to a specific supervised task, for which you generally need in the order of 1000 or more samples including labels.
Pretraining, on the other hand, is basically trying to help BERT better "understand" data from a certain domain, by basically continuing its unsupervised training objective ([MASK]ing specific words and trying to predict what word should be there), for which you do not need labeled data.
If your ultimate objective is sentence embeddings, however, I would strongly suggest you to have a look at Sentence Transformers, which is based on a slightly outdated version of Huggingface's transformers library, but primarily tries to generate high-quality embeddings. Note that there are ways to train with surrogate losses, where you try to emulate some form ofloss that is relevant for embeddings.
Edit: The author of Sentence-Transformers recently joined Huggingface, so I expect support to greatly improve over the upcoming months!
#dennlinger gave an exhaustive answer. Additional pretraining is also referred as "post-training", "domain adaptation" and "language modeling fine-tuning". here you will find an example how to do it.
But, since you want to have good sentence embeddings, you better use Sentence Transformers. Moreover, they provide fine-tuned models, which already capable of understanding semantic similarity between sentences. "Continue Training on Other Data" section is what you want to further fine-tune the model on your domain. You do have to prepare training dataset, according to one of available loss functions. E.g. ContrastLoss requires a pair of texts and a label, whether this pair is similar.
I believe transfer learning is useful to train the model on a specific domain. First you load the pretrained base model and freeze its weights, then you add another layer on top of the base model and train that layer based on your own training data. However, the data would need to be labelled.
Tensorflow has some useful guide on transfer learning.
You are talking about pre-training. Fine-tuning on unlabeled data is called pre-training and for getting started, you can take a look over here.
BERT pre-training of the base-model is done by a language modeling approach, where we mask certain percent of tokens in a sentence, and we make the model learn those missing mask. Then, I think in order to do downstream tasks, we add a newly initialized layer and we fine-tune the model.
However, suppose we have a gigantic dataset for sentence classification. Theoretically, can we initialize the BERT base architecture from scratch, train both the additional downstream task specific layer + the base model weights form scratch with this sentence classification dataset only, and still achieve a good result?
Thanks.
BERT can be viewed as a language encoder, which is trained on a humongous amount of data to learn the language well. As we know, the original BERT model was trained on the entire English Wikipedia and Book corpus, which sums to 3,300M words. BERT-base has 109M model parameters. So, if you think you have large enough data to train BERT, then the answer to your question is yes.
However, when you said "still achieve a good result", I assume you are comparing against the original BERT model. In that case, the answer lies in the size of the training data.
I am wondering why do you prefer to train BERT from scratch instead of fine-tuning it? Is it because you are afraid of the domain adaptation issue? If not, pre-trained BERT is perhaps a better starting point.
Please note, if you want to train BERT from scratch, you may consider a smaller architecture. You may find the following papers useful.
Well-Read Students Learn Better: On the Importance of Pre-training Compact Models
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
I can give help.
First of all, MLM and NSP (which are the original pre-training objectives from NAACL 2019) are meant to train language encoders with prior language knowledge. Like a primary school student who read many books in the general domain. Before BERT, many neural networks would be trained from scratch, from a clean slate where the model doesn't know anything. This is like a newborn baby.
So my question is, "is it a good idea to start teaching a newborn baby when you can begin with a primary school student?" My answer is no. This is supported by numerous State-of-The-Arts achieved by the pre-trained models, compared to the old methods of training a neural network from scratch.
As someone who works in the field, I can assure you that it is a much better idea to fine-tune a pre-trained model. It doesn't matter if you have a 200k dataset or a 1mil datapoints. In fact, more fine-tuning data will only make the downstream results better if you use the right hyperparameters.
Though I recommend the learning rate between 2e-6 ~ 5e-5 for sentence classification tasks, you can explore. If your dataset is very, very domain-specific, it's up to you to fine-tune with a higher learning rate, which will deviate the model further away from its "pre-trained" knowledge.
And also, regarding your question on
can we initialize the BERT base architecture from scratch, train both the additional downstream task specific layer + the base model weights form scratch with this sentence classification dataset only, and still achieve a good result?
I'm negative about this idea. Even though you have a dataset with 200k instances, BERT is pre-trained on 3300mil words. BERT is too inefficient to be trained with 200k instances (both size-wise and architecture-wise). If you want to train a neural network from scratch, I'd recommend you look into LSTMs or RNNs.
I'm not saying I recommend LSTMs. Just fine-tune BERT. 200k is not even too big anyways.
All the best luck with your NLP studies :)