I'm trying to figure out how to properly markup for Relation Extraction task (I'm going to use the lstm model).
At the moment, I figured out that entities are highlighted using the
<e1>, </e1>, <e2> and </e2>
tags. And in a separate column, the class of the relationship is indicated.
But what to do in the case when in the sentence one entity has relations of the same type or different ones at once to two other entities.
An example in the image.
Or when there are four entities in one sentence and two relations are defined.
I have two options. The first is to introduce new tags
<e3>, </e3>, <e4> and </e4>
and do multi-class classification. But I haven't seen it done anywhere. The second option is to make a copy of the proposal and share the relationship in this way.
Can you please tell me how to do this markup?
Related
I am trying to build a relation extraction model via spacy's prodigy tool.
NOTE: ner.manual, ner.correct, rel.manual are all recipes provided by prodigy.
(ner.manual, ner.correct) The first step involved annotating and training a NER model that can predict entities (this step is done and model is obtained)
The next step involved annotating the relations between the entities. Now, this step could be done int wo different method.
i. Label the entities and relations all from scratch
ii. Use the trained NER model to predict the entities in the UI tool and make corrections to it if needed (similar to ner.correct) and label the relations between the entities
The issue I am now facing is, whenever I use the trained model in the recipe's loop (rel.manual), there is no entities predicted.
Could someone help me with this??
PS: There is no trailing whitespaces issue, i cross-verified it
This is a question regarding training models on SPACY3.x.
I couldn't find a good answer/solution on StackOverflow hence the query.
If I am using the existing model in spacy like the en model and want to add my own entities in the model and train it, let's say since I work in the biomedical domain, things like virus name, shape, length, temperature, temperature value, etc. I don't want to lose the entities tagged by Spacy like organization names, country, etc.
All suggestions are appreciated.
Thanks
There are a few ways to do that.
The best way is to train your own model separately and then combine both models in one pipeline, with one before the other. See the double NER example project for an overview of that.
It's also possible to update the pretrained NER model, see this example project. However this isn't usually a good idea, and definitely not if you're adding completely different entities. You'll run into what's called "catastrophic forgetting", where even though you're technically updating the model, it ends up forgetting everything not represented in your current training data.
I have trained model using Auto Natural Language Processing - Entity extraction. For now I have trained this model to extract single keyword under each entity from text however I want to tag single keyword under two entity to create a hierarchy. Example - For now keyword "Lazada" tagged under "Lazada_Ecommerce" however I want to tag this single keyword under two entities - sub-entity "Lazada" and main entity "Ecommerce". It would be great help if someone suggest if it is possible with Google Auto NLP-Entity Extraction model and how.
Thanks,
Satish Kumar
Data Scientist
Google NLP Entity Extraction does not support entity hierarchies. The result of a prediction includes an array of entities, corresponding to each detected entity in the text.
https://cloud.google.com/automl/docs/reference/rpc/google.cloud.automl.v1#google.cloud.automl.v1.PredictResponse
includes property 'payload' which is an array of:
https://cloud.google.com/automl/docs/reference/rpc/google.cloud.automl.v1#google.cloud.automl.v1.AnnotationPayload
Note: If a "sub-entity" can only have one "main entity", then you could manage entity hierarchies external to the model, i.e., train the model to predict "Lazada" and other sub-entities, and externally identify that "Lazada" and others belong to a main "Ecommerce" category. However, if your entity model could have a "Lazada" entity underneath multiple main entities then your current solution would be appropriate (e.g., "Lazada_Ecommerce", "Lazada_SomeOtherMainEntity", etc.).
I added a new entity called "orgName" to en_core_web_lg using https://spacy.io/usage/training#example-new-entity-type
All my training data (26k sentences) have the "orgName" labeled in them.
To deal with the catastrophic forgetting problem, I ran en_core_web_lg on those 26k raw sentences and added the ORG, PROD, FAC, etc. entities as labels and not face the colliding entities, I created duplicates.
So, for a sentence A which was labeled by "orgName", I created a duplicate A2 which has ORG, PROD, FAC, etc. ending up with about 52k sentences.
I trained using 100 iterations.
Now, the problem is that testing the model even on the training sentences, it's not showing the ORG, PROD, FAC, etc. but only showing "orgName".
Where do you think the problem is?
In principle the way you're trying to solve the catastrophic forgetting problem, by retraining it on its old predictions, seems like a good approach to me.
However, if you are having duplicate versions of the same sentence, but annotated differently, and feeding that to the NER classifier, you may confuse the model. The reason is that it doesn't just look at the positive examples, but also explicitely sees non-annotated words as negative cases.
So if you have "Bob lives in London", and you only annotate "London", then it will think Bob is surely not an NE. If then you have a second sentence where you annotate only Bob, it will "unlearn" that London is an NE, because now it's not annotated as such. So consistency really is important.
I would suggest to implement a more advanced algorithm to resolve the conflicts.
One option is to always just take the annotated entity with the longest Span. But if the Spans are often exactly the same, you may need to reconsider your label scheme. Which entities collide most often? I would assume ORG and OrgName? Do you really need ORG? Perhaps the two can be "merged" as the same entity?
I got the concepts of distant supervision. As for my understanding, the creating training data process is like;
Extract named entities from sentences
Find two entities named "e1" and "e2" from each sentence.
Search these two entities in knowledge base (freebase etc.) to find relationship between them
I got confused at this step. What if there is more than 1 relation between these two entities (e1 and e2) ? If so which relation should I select?
It depends on the model you're training.
Are you learning a model for one kind of relationship and doing bootstrapping? Then only pay attention to that one relationship and drop the others from your DB.
Are you trying to learn a bunch of relationships? Then use the presence or absence of each as a feature in your model. This is how Universals Schemas work.
Here's an image of a feature matrix from the Universal Schema paper: