Training spaCy TextCategorizer with data that belongs to no label? - nlp

I'm gathering training data for multilabel classification. Some of the data fed into this project will not have enough information to assign it to one of the labels. If I train the model with data that belongs to no label, will it avoid labelling new data that is unclear? Do I need to train it with an "Unclear" label or should I just leave this type of data unlabelled?
I can't seem to find the answer to this question in the spaCy docs.

Assuming you really want multilabel classification, i.e. an instance can have zero or multiple classes, then it's fine to have some data without any label. If the model performs correctly, it should also predict no label for similar instances. Be careful however that no label doesn't mean unclear for the model, it means that none of the possible classes apply (they are considered independently).
Note that in the case of multiclass classification, i.e. an instance always has exactly one class, it is impossible to assign no label to an instance. But it would also be suboptimal to create a class 'unclear', because in multiclass classification the model predicts the most likely class, i.e. relatively to the others. Semantically 'no label' is not a regular label comparable to the others.
Technically this is not a programming question (for future reference, better ask such questions on https://datascience.stackexchange.com/ or https://stats.stackexchange.com/).

Related

pos_weight in multilabel classification in pytorch

I am using pytorch for multilabel classification. I have used pos_weights in BCELoss since i have imbalanced data. FOr to use pos_weight, whether we need to take the entire dataset(train, validation, test) or only the training set for calculating the pos_Weight... Thanks...
While not a coding question and better suited for a different SE site, the quick answer is this:
You always assume you have never seen the test set before, so you cannot use it in any way to make decisions about the model design. For the validation set, a similar argument can be made in that you want to validate at regular intervals using unseen data. As such, you want to calculate class weights using the train data only.
Do keep in mind that if the class distribution is not a representation of the class distribution in unseen data (i.e. the real world, or your test set), then the model will optimize for the wrong class distribution. This should be solved by analyzing the task better, not by directly using the test set to determine class distribution.

How can/should we weight classes in HuggingFace token classification (entity recognition)?

I'm training a token classification (AKA named entity recognition) model with the HuggingFace Transformers library, with a customized data loader.
Like most NER datasets (I'd imagine?) there's a pretty significant class imbalance: A large majority of tokens are other - i.e. not an entity - and of course there's a little variation between the different entity classes themselves.
As we might expect, my "accuracy" metrics are getting distorted quite a lot by this: It's no great achievement to get 80% token classification accuracy if 90% of your tokens are other... A trivial model could have done better!
I can calculate some additional and more insightful evaluation metrics - but it got me wondering... Can/should we somehow incorporate these weights into the training loss? How would this be done using a typical *ForTokenClassification model e.g. BERTForTokenClassification?
This is actually a really interesting question, since it seems there is no intention (yet) to modify losses in the models yourself. Specifically for BertForTokenClassification, I found this code segment:
loss_fct = CrossEntropyLoss()
# ...
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
To actually change the loss computation and add other parameters, e.g., the weights you mention, you can go about either one of two ways:
You can modify a copy of transformers locally, and install the library from there, which makes this only a small change in the code, but potentially quite a hassle to change parts during different experiments, or
You return your logits (which is the case by default), and calculate your own loss outside of the actual forward pass of the huggingface model. In this case, you need to be aware of any potential propagation from the loss calculated within the forward call, but this should be within your power to change.

Training Doc2vec with new data

I have a doc2vec model trained on documents with labels. I'm trying to continue training my model with model.train(). The new data comes with new labels as well, but, when I train it on more documents, the new labels aren't being recorded... Does anyone know what my problem might be?
Gensim's Doc2Vec only learns its set of tags at the same time it learns the corpus vocabulary of unique words – during the first call to .build_vocab() on the original corpus.
When you train with additional examples that have either words or tags that aren't already known to the model, those words or tags are simply ignored.
(The .build_vocab(…, update=True) option that's available on Word2Vec to expand its vocabulary has never been fully applied to Doc2Vec, either with respect to tags or with respect to a longstanding crashing bug. So it's not supported on Doc2Vec.)
Note that if it is your aim to create document-vectors that assist in some downstream-classification task, you may not want to supply your known-labels as tags, or at least not as a document's only tag.
The tags you supply to Doc2Vec are the units for which it learns vectors. If you have a million text examples, but only 5 different labels, if you feed those million examples into training each with only the label as a tag, the model is only learning 5 doc-vectors. It is, essentially, like you're training on only 5 mega-documents (passed in in chunks) – and thus 'summarizing' each label down to a single point in vector-space, when it might be far more useful to think of a label as covering a irregularly-shaped "point cloud".
So, you might instead want to use document-IDs rather than labels. (Or, labels and document-IDs.) Then, use the many varied vectors from all individual documents – rather than single vectors per label – to train some downstream classifier or clusterer.
And in that case, the arrival of documents with new labels might not require a full Doc2Vec-retraining. Instead, if the new documents still get useful vectors from inference on the older Doc2Vec model, those per-doc vectors may reflect enough about the new label's documents that downstream classifiers can learn to recognize them.
Ultiamtely, though, if you acquire much more training data, reflecting all new vocabularies & word-senses, the safest approach is to retrain a Doc2Vec model from scratch, using all data. Simply incremental training, even if it had official support, risks pulling those words/tags that appear in new data arbitrarily out-of-comparable-alignment with words/tags that were only trained in the original dataset. It is the interleaved co-training, alongside all other examples equally, which pushes-and-pulls all vectors in a model into useful relative arrangements.

Is it possible to supplement Naive Bayes text classification algorithm with author information?

I am working on a text classification project where I am trying to assign topic classifications to speeches from the Congressional Record.
Using topic codes from the Congressional Bills Project (http://congressionalbills.org/), I've tagged speeches that mention a specific bill as belonging to the topic of the bill. I'm using this as my "training set" for the model.
I have a "vanilla" Naive Bayes classifier working well-enough, but I keep feeling like I could get better accuracy out of the algorithm by incorporating information about the member of Congress who is making the speech (e.g. certain members are much more likely to talk about Foreign Policy than others).
One possibility would be to replace the prior in the NB classifier (usually defined as the proportion of documents in the training set that have the given classification) with speaker's observed prior speeches.
Is this worth pursuing? Are there existing approaches that have followed this same kind of logic? I'm a little bit familiar with the "author-topic models" that come out of Latent Dirichlet Allocation models, but I like the simplicity of the NB model.
There is no need to modify anything, simply add this information to your Naive Bayes and it will work just fine.
And as it was previously mentioned in the comment - do not change any priors - prior probability is P(class), this has nothing to do with actual features.
Just add to your computations another feature corresponding to the authorship, e.g. "author:AUTHOR" and train Naive Bayes as usual, ie. compute P(class|author:AUTHOR) for each class and AUTHOR and use it later on in your classification process.If your current representation is a bag of words, it is sufficient to add a "artificial" word of form "author:AUTHOR" to it.
One other option would be to train independent classifier for each AUTHOR, which would capture person-specific type of speech, for example - one uses lots of words "environment" only when talking about "nature", while other simply likes to add this word in each speach "Oh, in our local environment of ...". Independent NBs would capture these kind of phenomena.

Parameter tuning for 1-class classification with LibSVM in weka

I am doing a 1-class classification with LibSVM wrapper in Weka. But the problem is during TESTING, even if I use the same TRAINING instances, I see most of them are classified as outliers (NaN) which is unreasonable (how this can happen?). If this is something to deal with parameter tuning, what parameters should I try tweaking?
A classifier needs at least two class values to "work". If all you have is labeled data with one label value(your one class value), then you need to get data that is not part of that class so that a classifier can function.

Resources