Spacy 3.0 Training custom NER --> Validation of this custom NER model - python-3.x

I trained a custom SpaCy Named entity recognition model to detect biased words in job description. Now that I trained 8 variantions (using different base model, training model, and pipeline setting), I want to evaluate which model is performing best.
But.. I can't find any documentation on the validation of these models.
There are some numbers of recall, f1-score and precision on the meta.json file, in the output folder, but that is no sufficient.
Anyone knows how to validate or can link me to the correct documentation? The documentation seem nowhere to be found.
NOTE: Talking about SpaCy V3.x

During training you should provide "evaluation data" that can be used for validation. This will be evaluated periodically during training and appropriate scores will be printed.
Note that there's a lot of different terminology in use, but in spaCy there's "training data" that you actually train on and "evaluation data" which is not training and just used for scoring during the training process. To evaluate on held-out test data you can use the cli evaluate command.
Take a look at this fashion brands example project to see how "eval" data is configured and used.

Related

Hugging face transformer: model bio_ClinicalBERT not trained for any of the task?

This maybe the most beginner question of all :sweat:.
I just started learning about NLP and hugging face. The first thing I'm trying to do is to apply one the bioBERT models on some clinical note data and see what I do, before moving on to the fine-tuning the model. And it looks like "emilyalsentzer/Bio_ClinicalBERT" to be the closest model for my data.
But as I try to use it for any of the analyses I always get this warning.
Some weights of the model checkpoint at emilyalsentzer/Bio_ClinicalBERT were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
From the hugging face course chapter 2 I understand this meant.
This is because BERT has not been pretrained on classifying pairs of sentences, so the head of the pretrained model has been discarded and a new head suitable for sequence classification has been added instead. The warnings indicate that some weights were not used (the ones corresponding to the dropped pretraining head) and that some others were randomly initialized (the ones for the new head). It concludes by encouraging you to train the model, which is exactly what we are going to do now.
So I went on to test which NLP task I can use "emilyalsentzer/Bio_ClinicalBERT" for, out of the box.
from transformers import pipeline, AutoModel
checkpoint = "emilyalsentzer/Bio_ClinicalBERT"
nlp_task = ['conversational', 'feature-extraction', 'fill-mask', 'ner',
'question-answering', 'sentiment-analysis', 'text-classification',
'token-classification',
'zero-shot-classification' ]
for task in nlp_task:
print(task)
process = pipeline(task=task, model = checkpoint)
And I got the same warning message for all the NLP tasks, so it appears to me that I shouldn't/advised not to use the model for any of the tasks. This really confuses me. The original bio_clinicalBERT model paper stated that they had good results on a few different tasks. So certainly the model was trained for those tasks. I also have similar issue with other models as well, i.e. the blog or research papers said a model obtained good results with a specific task but when I tried to apply with pipeline it gives the warning message. Is there any reason why the head layers were not included in the model?
I only have a few hundreds clinical notes (also unannotated :frowning_face:), so it doesn't look like it's big enough for training. Is there any way I could use the model on my data without training?
Thank you for your time.
This Bio_ClinicalBERT model is trained for Masked Language Model (MLM) task. This task basically used for learning the semantic relation of the token in the language/domain. For downstream tasks, you can fine-tune the model's header with your small dataset, or you can use a fine-tuned model like Bio_ClinicalBERT-finetuned-medicalcondition which is the fine-tuned version of the same model. You can find all the fine-tuned models in HuggingFace by searching 'bio-clinicalBERT' as in the link.

Bert for relation extraction

i am working with bert for relation extraction from binary classification tsv file, it is the first time to use bert so there is some points i need to understand more?
how can i get an output like giving it a test data and show the classification results whether it is classified correctly or not?
how bert extract features of the sentences, and is there a method to know what are the features that is chosen?
i used once the hidden layers and another time i didn't use i got the accuracy of not using the hidden layer higher than using it, is there an reason for that?

How to fine tune BERT on unlabeled data?

I want to fine tune BERT on a specific domain. I have texts of that domain in text files. How can I use these to fine tune BERT?
I am looking here currently.
My main objective is to get sentence embeddings using BERT.
The important distinction to make here is whether you want to fine-tune your model, or whether you want to expose it to additional pretraining.
The former is simply a way to train BERT to adapt to a specific supervised task, for which you generally need in the order of 1000 or more samples including labels.
Pretraining, on the other hand, is basically trying to help BERT better "understand" data from a certain domain, by basically continuing its unsupervised training objective ([MASK]ing specific words and trying to predict what word should be there), for which you do not need labeled data.
If your ultimate objective is sentence embeddings, however, I would strongly suggest you to have a look at Sentence Transformers, which is based on a slightly outdated version of Huggingface's transformers library, but primarily tries to generate high-quality embeddings. Note that there are ways to train with surrogate losses, where you try to emulate some form ofloss that is relevant for embeddings.
Edit: The author of Sentence-Transformers recently joined Huggingface, so I expect support to greatly improve over the upcoming months!
#dennlinger gave an exhaustive answer. Additional pretraining is also referred as "post-training", "domain adaptation" and "language modeling fine-tuning". here you will find an example how to do it.
But, since you want to have good sentence embeddings, you better use Sentence Transformers. Moreover, they provide fine-tuned models, which already capable of understanding semantic similarity between sentences. "Continue Training on Other Data" section is what you want to further fine-tune the model on your domain. You do have to prepare training dataset, according to one of available loss functions. E.g. ContrastLoss requires a pair of texts and a label, whether this pair is similar.
I believe transfer learning is useful to train the model on a specific domain. First you load the pretrained base model and freeze its weights, then you add another layer on top of the base model and train that layer based on your own training data. However, the data would need to be labelled.
Tensorflow has some useful guide on transfer learning.
You are talking about pre-training. Fine-tuning on unlabeled data is called pre-training and for getting started, you can take a look over here.

load Doc2Vec model and get new sentence's vectors for test

I have read lots of examples regarding doc2vec, but I couldn't find any answer. Like a real example, I want to build a model with doc2vec and then train it with some ML models. after that, how can I get the vector of a raw string with the exact trained Doc2vec model? because I need to predict with my ML model with the same size and logical vector
There are a collection of example Jupyter (aka IPython) notebooks in the gensim docs/notebooks directory. You can view them online at:
https://github.com/RaRe-Technologies/gensim/tree/develop/docs/notebooks
But they'll be in your gensim installation directory, if you can find that for your current working environment.
Those that include doc2vec in their name demonstrate the use of the Doc2Vec class. The most basic intro operates on the 'Lee' corpus that's bundled with gensim for use in its unit tests. (It's really too small for real Doc2Vec success, but by forcing smaller models and many training iterations the notebook just barely manages to get some consistent results.) See:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
It includes a section on inferring a vector for a new text:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
Note that inference is performed on a list of string tokens, not a raw string. And those tokens should have been preprocessed/tokenized the same way as the original training data for the model, so that the vocabularies are compatible. (Any unknown words in a new text are silently ignored.)
Note also that especially on short texts, it often helps to provide a much-larger-than-default value of the optional steps parameter to infer_vector() - say 50 or 200 rather than the default 5. It may also help to provide a starting alpha parameter more like the training default of 0.025 than the method-default of 0.1.

unary class text classification in weka?

I have a training dataset (text) for a particular category (say Cancer). I want to train a SVM classifier for this class in weka. But when i try to do this by creating a folder 'cancer' and putting all those training files to that folder and when i run to code i get the following error:
weka.classifiers.functions.SMO: Cannot handle unary class!
what I want to do is if the classifier finds a document related to 'cancer' it says the class name correctly and once i fed a non cancer document it should say something like 'unknown'.
What should I do to get this behavior?
The SMO algorithm in Weka only does binary classification between two classes. Sequential Minimal Optimization is a specific algorithm for solving an SVM and in Weka this a basic implementation of this algorithm. If you have some examples that are cancer and some that are not, then that would be binary, perhaps you haven't labeled them correctly.
However, if you are using training data which is all examples of cancer and you want it to tell you whether a future example fits the pattern or not, then you are attempting to do one-class SVM, aka outlier detection.
LibSVM in Weka can handle one-class svm. Unlike the Weka SMO implementation, LibSVM is a standalone program which has been interfaced into Weka and incorporates many different variants of SVM. This post on the Wekalist explains how to use LibSVM for this in Weka.

Resources