This is a question regarding training models on SPACY3.x.
I couldn't find a good answer/solution on StackOverflow hence the query.
If I am using the existing model in spacy like the en model and want to add my own entities in the model and train it, let's say since I work in the biomedical domain, things like virus name, shape, length, temperature, temperature value, etc. I don't want to lose the entities tagged by Spacy like organization names, country, etc.
All suggestions are appreciated.
Thanks
There are a few ways to do that.
The best way is to train your own model separately and then combine both models in one pipeline, with one before the other. See the double NER example project for an overview of that.
It's also possible to update the pretrained NER model, see this example project. However this isn't usually a good idea, and definitely not if you're adding completely different entities. You'll run into what's called "catastrophic forgetting", where even though you're technically updating the model, it ends up forgetting everything not represented in your current training data.
Related
I am trying to use Mlflow to manage my Bayesian Optimization Model, which has several methods other than the predict (run_optimization() for example). My doubt is that when I log my model to the tracking server the model and retrieve it, it only contains the predict() as it is wrapped as a PyFunctModel; that's a problem because I need the model also to run prescriptions (suggestion of a possible new optimum), does anyone ever tried it? Thanks
I want to fine tune BERT on a specific domain. I have texts of that domain in text files. How can I use these to fine tune BERT?
I am looking here currently.
My main objective is to get sentence embeddings using BERT.
The important distinction to make here is whether you want to fine-tune your model, or whether you want to expose it to additional pretraining.
The former is simply a way to train BERT to adapt to a specific supervised task, for which you generally need in the order of 1000 or more samples including labels.
Pretraining, on the other hand, is basically trying to help BERT better "understand" data from a certain domain, by basically continuing its unsupervised training objective ([MASK]ing specific words and trying to predict what word should be there), for which you do not need labeled data.
If your ultimate objective is sentence embeddings, however, I would strongly suggest you to have a look at Sentence Transformers, which is based on a slightly outdated version of Huggingface's transformers library, but primarily tries to generate high-quality embeddings. Note that there are ways to train with surrogate losses, where you try to emulate some form ofloss that is relevant for embeddings.
Edit: The author of Sentence-Transformers recently joined Huggingface, so I expect support to greatly improve over the upcoming months!
#dennlinger gave an exhaustive answer. Additional pretraining is also referred as "post-training", "domain adaptation" and "language modeling fine-tuning". here you will find an example how to do it.
But, since you want to have good sentence embeddings, you better use Sentence Transformers. Moreover, they provide fine-tuned models, which already capable of understanding semantic similarity between sentences. "Continue Training on Other Data" section is what you want to further fine-tune the model on your domain. You do have to prepare training dataset, according to one of available loss functions. E.g. ContrastLoss requires a pair of texts and a label, whether this pair is similar.
I believe transfer learning is useful to train the model on a specific domain. First you load the pretrained base model and freeze its weights, then you add another layer on top of the base model and train that layer based on your own training data. However, the data would need to be labelled.
Tensorflow has some useful guide on transfer learning.
You are talking about pre-training. Fine-tuning on unlabeled data is called pre-training and for getting started, you can take a look over here.
I added a new entity called "orgName" to en_core_web_lg using https://spacy.io/usage/training#example-new-entity-type
All my training data (26k sentences) have the "orgName" labeled in them.
To deal with the catastrophic forgetting problem, I ran en_core_web_lg on those 26k raw sentences and added the ORG, PROD, FAC, etc. entities as labels and not face the colliding entities, I created duplicates.
So, for a sentence A which was labeled by "orgName", I created a duplicate A2 which has ORG, PROD, FAC, etc. ending up with about 52k sentences.
I trained using 100 iterations.
Now, the problem is that testing the model even on the training sentences, it's not showing the ORG, PROD, FAC, etc. but only showing "orgName".
Where do you think the problem is?
In principle the way you're trying to solve the catastrophic forgetting problem, by retraining it on its old predictions, seems like a good approach to me.
However, if you are having duplicate versions of the same sentence, but annotated differently, and feeding that to the NER classifier, you may confuse the model. The reason is that it doesn't just look at the positive examples, but also explicitely sees non-annotated words as negative cases.
So if you have "Bob lives in London", and you only annotate "London", then it will think Bob is surely not an NE. If then you have a second sentence where you annotate only Bob, it will "unlearn" that London is an NE, because now it's not annotated as such. So consistency really is important.
I would suggest to implement a more advanced algorithm to resolve the conflicts.
One option is to always just take the annotated entity with the longest Span. But if the Spans are often exactly the same, you may need to reconsider your label scheme. Which entities collide most often? I would assume ORG and OrgName? Do you really need ORG? Perhaps the two can be "merged" as the same entity?
Could I ask about Stanford NER?? Actually, I'm trying to train my own model, to use it later for learning. According to the documentation, I have to add my own features in SeqClassifierFlags and add code for each Feature in NERFeatureFactory.
My questions is that, I have my tokens with all features extracted and Last column represents the label. So, is there any way in Stanford NER to give it my Tab-Delimeted file which contains 30 columns (1 is word , 28 are featurs, and 1 is label) to train my own model without spending time for extracting features???
Of course, in Testing phase, I will give it a file like the the aforementioned file without label to predict the label.
Is this possible or Not??
Many thanks in Advance
As explained in the FAQ page, the only way to customize the NER model is by inserting the data and specifying the features that you want to extract.
But, wait ... you have the data, and you have managed to extract the features, so I think you don't need the NER model, you need a classifier. I know this answer is pretty pretty late, but maybe this classifier will be a good place to start.
I'm trying to get in with openNlp. I need it to get new organizations(startups) from news websites (for example: techcrunch). I have a model with organizations, which I use to recognize organizations in publications(en-ner-organization). And here I have a question:
In case there is a publication about new startup, which was born yesterday,
will openNlp recognize it as organization?
As far as I understand - no. Until I don't train model with this new startup, right?
If all my assumptions are correct, the model partially contains of organizations names, so if I want my model to recognize new organization, I have to train it with it's name.
Thanks
As far as I know, OpenNLP should use a statistical model to address named entity recognition: this means that, if OpenNLP has been properly trained with enough data, it should be able to recognize new startups (it's not a grep of known tokens over a file).
Of course metrics such as precision, recall and F1 are useful to determine the accuracy of the algorithm.