Recently,i have read about the "discriminative reranking for natural language processing" by Collins.
I'm confused what does the reranking actually do?
Add more global features to the rerank model? or something else?
If you mean this paper, then what is done is the following:
train a parser using a generative model, i.e. one where you compute P(term | tree) and use Bayes' rule to reverse that and get P(tree | term),
apply that to get an initial k-best ranking of trees from the model,
train a second model on features of the desired trees,
apply that to re-rank the output from 2.
The reason why the second model is useful is that in generative models (such as naïve Bayes, HMMs, PCFGs), it can be hard to add features other than word identity, because the model would try to predict the probability of the exact feature vector instead of the separate features, which might not have occurred in the training data and will have P(vector|tree) = 0 and therefore P(tree|vector) = 0 (+ smoothing, but the problem remains). This is the eternal NLP problem of data sparsity: you can't build a training corpus that contains every single utterance that you'll want to handle.
Discriminative models such as MaxEnt are much better at handling feature vectors, but take longer to fit and can be more complicated to handle (although CRFs and neural nets have been used to construct parsers as discriminative models). Collins et al. try to find a middle ground between the fully generative and fully discriminative approaches.
Related
Let's imagine we generated a 200 dimension word vector using any pre-trained model of the word ('hello') as shown in the below image.
So, by any means can we tell which linguistic feature is represented by each d_i of this vector?
For example, d1 might be looking at whether the word is a noun; d2 might tell whether the word is a named entity or not and so on.
Because these word vectors are dense distributional representations, it is often difficult / impossible to interpret individual neurons, and such models often do not localize interpretable features to a single neuron (though this is an active area of research). For example, see Analyzing Individual Neurons in Pre-trained Language Models
for a discussion of this with respect to pre-trained language models).
A common method for studying how individual dimensions contribute to a particular phenomenon / task of interest is to train a linear model (i.e., logistic regression if the task is classification) to perform the task from fixed vectors, and then analyze the weights of the trained linear model.
For example, if you're interested in part of speech, you can train a linear model to map from the word vector to the POS [1]. Then, the weights of the linear model represent a linear combination of the dimensions that are predictive of the feature. For example, if the weight on the 5th neuron has large magnitude (very positive or very negative), you might expect that neuron to be somewhat correlated with the phenomenon of interest.
[1]: Note that defining a POS for a particular word is nontrivial, since the POS often depends on context. For example, "play" can be a noun ("he saw a play") or a verb ("I will play in the grass").
I have a query regarding the extraction of VGG16/VGG19 features for my experiments.
The pre-trained VGG16 and VGG19 models have been trained on ImageNet dataset having 1000 classes (say c1,c2, ... c1000) and normally we extract the features from first and second fully connected layers designated ('FC1' and 'FC2'); these 4096 dimensional feature vectors are then used for computer vision tasks.
My question is that can we use these networks to extract features of an image that does not belong to any of the above 1000 classes ? In other words, can we use these networks to extract features of an image with label c1001 ? Remember that c1001 does not belong to the Imagenet classes on which these networks were initially trained on.
In the article available on https://www.pyimagesearch.com/2019/05/20/transfer-learning-with-keras-and-deep-learning/, I am quoting the following -
When performing feature extraction, we treat the pre-trained network
as an arbitrary feature extractor, allowing the input image to
propagate forward, stopping at pre-specified layer, and taking the
outputs of that layer as our features
From the above text, there is no restriction to whether the image must necessarily belong to one of the Imagenet classes.
Kindly spare some time to uncover this mystery.
In the research papers, the authors simply state that they have used features extracted from VGG16/VGG19 network pre-trained on Imagenet dataset without giving any further details.
I am giving a case study for reference:
Animal with Attribute dataset (see https://cvml.ist.ac.at/AwA2/) is a very popular dataset with 50 animal classes for image recognition task. The authors have extracted ILSVRC-pretrained ResNet101 features for the above dataset images. This ResNet 101 network has been pre-trained on 1000 imagenet classes (different imagenet classes are available at https://gist.github.com/yrevar/942d3a0ac09ec9e5eb3a#file-imagenet1000_clsidx_to_labels-txt).
Also, the AWA classes are put as follows:
antelope, grizzly+bear, killer+whale, beaver, dalmatian, persian+cat, horse
german+shepherd, blue+whale, siamese+cat, skunk, mole, tiger, hippopotamus, leopard, moose, spider+monkey, humpback+whale, elephant, gorilla, ox, fox, sheep
seal, chimpanzee, hamster, squirrel, rhinoceros, rabbit, bat, giraffe, wolf, chihuahua, rat, weasel, otter, buffalo, zebra, giant+panda, deer, bobcat, pig, lion, mouse, polar+bear, collie, walrus, raccoon, cow, dolphin
Now, if we compare the classes in the dataset with 1000 Imagenet classes, we find that classes like dolphin, cow, racoon, bobcat, bat, seal, sheep, horse, grizzly bear, giraffe etc are not there in the Imagenet and still the authors went on with extracting ResNet101 features. I believe that the features extracted are generalizable and that is why authors consider these features as meaningful representations for the AWA images.
Your take on this ?
The idea is to get the representations for the images not belonging to ImageNet classes and use them along with their labels in some other classifier.
Yes, you can, but.
Features in first fully-connected layers suppose to encode very general patterns, like angles, lines, and simple shapes. You can assume those can be generalized outside the class set it was trained on.
There is one But, however - those features were found as to minimize error on that particular classification task with 1000 classes. It means, that there can be no guarantee that they are helpful for classifying arbitrary class.
For only extracting the features, you can input any image you want in your pretrained VGG/other CNN. However, for the purpose of training, you have to implement other steps as stated below.
The features that are extracted have been determined by means of exclusively training on those 1000 classes and belong to those 1000 classes. You can use your network to predict on images that do not belong to those 1000 classes, but in the paragraphs below I explain why this is not the desired approach.
The key point to outline here is that, the set features that were extracted can be used to detect/determine the presence of other objects within a photo, but not "ready"/"out of the box".
For example, edges and lines are features that are not related exclusively to those 1000 classes, but also to other ones, hence they are useful, general features.
Therefore, you can employ "transfer learning", to train on your own images (dataset), for example c1001, c1002, c1003.
Notice however that you need to train on your own set before you can use the network to predict on your new images(new classes). Transfer learning refers to using the set of already gathered/learned features, which can be suitable to apply on another problem, but you need to train on your "new problem", say c1001, c1002, c1003.
For Image classification you may need to fine tune the model using relevant classes for c1001 class label.
But if you are planning to use it for unsupervised learning and using it for feature extraction part only, then there is no need to retrain the model. You can use existing pre-trained weights from ImageNet and extract feature then using that weights as VGG16/19 will generalize lower level feature in its initial layers and last few layers are only used for classification purpose.
So basically pretrained model can be used for unsupervised and feature extraction purpose without retraining.
I want to fine tune BERT on a specific domain. I have texts of that domain in text files. How can I use these to fine tune BERT?
I am looking here currently.
My main objective is to get sentence embeddings using BERT.
The important distinction to make here is whether you want to fine-tune your model, or whether you want to expose it to additional pretraining.
The former is simply a way to train BERT to adapt to a specific supervised task, for which you generally need in the order of 1000 or more samples including labels.
Pretraining, on the other hand, is basically trying to help BERT better "understand" data from a certain domain, by basically continuing its unsupervised training objective ([MASK]ing specific words and trying to predict what word should be there), for which you do not need labeled data.
If your ultimate objective is sentence embeddings, however, I would strongly suggest you to have a look at Sentence Transformers, which is based on a slightly outdated version of Huggingface's transformers library, but primarily tries to generate high-quality embeddings. Note that there are ways to train with surrogate losses, where you try to emulate some form ofloss that is relevant for embeddings.
Edit: The author of Sentence-Transformers recently joined Huggingface, so I expect support to greatly improve over the upcoming months!
#dennlinger gave an exhaustive answer. Additional pretraining is also referred as "post-training", "domain adaptation" and "language modeling fine-tuning". here you will find an example how to do it.
But, since you want to have good sentence embeddings, you better use Sentence Transformers. Moreover, they provide fine-tuned models, which already capable of understanding semantic similarity between sentences. "Continue Training on Other Data" section is what you want to further fine-tune the model on your domain. You do have to prepare training dataset, according to one of available loss functions. E.g. ContrastLoss requires a pair of texts and a label, whether this pair is similar.
I believe transfer learning is useful to train the model on a specific domain. First you load the pretrained base model and freeze its weights, then you add another layer on top of the base model and train that layer based on your own training data. However, the data would need to be labelled.
Tensorflow has some useful guide on transfer learning.
You are talking about pre-training. Fine-tuning on unlabeled data is called pre-training and for getting started, you can take a look over here.
I am given a task of classifying a given news text data into one of the following 5 categories - Business, Sports, Entertainment, Tech and Politics
About the data I am using:
Consists of text data labeled as one of the 5 types of news statement (Bcc news data)
I am currently using NLP with nltk module to calculate the frequency distribution of every word in the training data with respect to each category(except the stopwords).
Then I classify the new data by calculating the sum of weights of all the words with respect to each of those 5 categories. The class with the most weight is returned as the output.
Heres the actual code.
This algorithm does predict new data accurately but I am interested to know about some other simple algorithms that I can implement to achieve better results. I have used Naive Bayes algorithm to classify data into two classes (spam or not spam etc) and would like to know how to implement it for multiclass classification if it is a feasible solution.
Thank you.
In classification, and especially in text classification, choosing the right machine learning algorithm often comes after selecting the right features. Features are domain dependent, require knowledge about the data, but good quality leads to better systems quicker than tuning or selecting algorithms and parameters.
In your case you can either go to word embeddings as already said, but you can also design your own custom features that you think will help in discriminating classes (whatever the number of classes is). For instance, how do you think a spam e-mail is often presented ? A lot of mistakes, syntaxic inversion, bad traduction, punctuation, slang words... A lot of possibilities ! Try to think about your case with sport, business, news etc.
You should try some new ways of creating/combining features and then choose the best algorithm. Also, have a look at other weighting methods than term frequencies, like tf-idf.
Since your dealing with words I would propose word embedding, that gives more insights into relationship/meaning of words W.R.T your dataset, thus much better classifications.
If you are looking for other implementations of classification you check my sample codes here , these models from scikit-learn can easily handle multiclasses, take a look here at documentation of scikit-learn.
If you want a framework around these classification that is easy to use you can check out my rasa-nlu, it uses spacy_sklearn model, sample implementation code is here. All you have to do is to prepare the dataset in a given format and just train the model.
if you want more intelligence then you can check out my keras implementation here, it uses CNN for text classification.
Hope this helps.
I am confused with what a linear chain CRF implementation exactly is. While some people say that "The Linear Chain CRF restricts the features to depend on only the current(i) and previous label(i-1), rather than arbitrary labels throughout the sentence" , some people say that it restricts the features to depend on the current(i) and future label(i+1).
I am trying to understand the implementation that goes behind the Stanford NER Model. Can someone please explain what exactly the linear chain CRF Model is?
Both models would be linear chain CRF models. The important part about the "linear chain" is that the features depend only on the current label and one direct neighbour in the sequence. Usually this would be the previous label (because that corresponds with reading order), but it could also be the future label. Such a model model would basically process the sentence backwards, and I have never seen this in the literature, but it would still be a linear chain CRF).
As far as I know, the Stanford NER model is based on a model that uses the current and the previous label, but it also uses an extension that can also look to labels further back. It is therefore not a strict linear-chain model, but uses an extension described in this paper:
Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370. http://nlp.stanford.edu/~manning/papers/gibbscrf3.pdf