Multiple Entity recognition with Spacy python Error - python-3.x

i am stuck on a problem and seeking help from you. i am trying to train multiple entity using spacy
Following is my Train Data
response =[
('java developer with java and html css javascript ',
{'entities': [(0, 14, 'jobtitle'),
(0 , 4, 'skills'),
(34,37,'skills'),
(38, 49, 'skills')
]
}),
('looking for software engineer with java python',
{
'entities': [
(12, 29, 'jobtitle'),
(40, 46, 'skills'),
(35,39,"skills")
]
})
]
here is train code i have issue
nlp = spacy.blank("en")
optimizer = nlp.begin_training()
for i in range(20):
random.shuffle(TRAIN_DATA)
for text, annotations in TRAIN_DATA:
nlp.update([text], [annotations], sgd=optimizer)
Error :
ValueError: [E103] Trying to set conflicting doc.ents: '(0, 14, 'jobtitle')' and '(0, 4, 'skills')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

As the error message explains, spacy's NER model does not support overlapping entity spans, so you can't train a model using these annotations.

Related

HuggingFace-Transformers --- NER single sentence/sample prediction

I am trying to predict with the NER model, as in the tutorial from huggingface (it contains only the training+evaluation part).
I am following this exact tutorial here : https://github.com/huggingface/notebooks/blob/master/examples/token_classification.ipynb
The training works flawlessly, but the problems that I have begin when I try to predict on a simple sample.
model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
loaded_model = AutoModel.from_pretrained('./my_model_own_custom_training.pth',
from_tf=False)
input_sentence = "John Nash is a great mathematician, he lives in France"
tokenized_input_sentence = tokenizer([input_sentence],
truncation=True,
is_split_into_words=False,
return_tensors='pt')
predictions = loaded_model(tokenized_input_sentence["input_ids"])[0]
Predictions is of shape (1,13,768)
How can I arrive at the final result of the form [JOHN <-> ‘B-PER’, … France <-> “B-LOC”], where B-PER and B-LOC are two ground truth labels, representing the tag for a person and location respectively?
The result of the prediction is:
torch.Size([1, 13, 768])
If I write:
print(predictions.argmax(axis=2))
tensor([613, 705, 244, 620, 206, 206, 206, 620, 620, 620, 477, 693, 308])
I get the tensor above.
However I would have expected to get the tensor representing the ground truth [0…8] labels from the ground truth annotations.
Summary when loading the model :
loading configuration file ./my_model_own_custom_training.pth/config.json
Model config DistilBertConfig {
“name_or_path": “distilbert-base-uncased”,
“activation”: “gelu”,
“architectures”: [
“DistilBertForTokenClassification”
],
“attention_dropout”: 0.1,
“dim”: 768,
“dropout”: 0.1,
“hidden_dim”: 3072,
“id2label”: {
“0”: “LABEL_0”,
“1”: “LABEL_1”,
“2”: “LABEL_2”,
“3”: “LABEL_3”,
“4”: “LABEL_4”,
“5”: “LABEL_5”,
“6”: “LABEL_6”,
“7”: “LABEL_7”,
“8”: “LABEL_8”
},
“initializer_range”: 0.02,
“label2id”: {
“LABEL_0”: 0,
“LABEL_1”: 1,
“LABEL_2”: 2,
“LABEL_3”: 3,
“LABEL_4”: 4,
“LABEL_5”: 5,
“LABEL_6”: 6,
“LABEL_7”: 7,
“LABEL_8”: 8
},
“max_position_embeddings”: 512,
“model_type”: “distilbert”,
“n_heads”: 12,
“n_layers”: 6,
“pad_token_id”: 0,
“qa_dropout”: 0.1,
“seq_classif_dropout”: 0.2,
“sinusoidal_pos_embds”: false,
"tie_weights”: true,
“transformers_version”: “4.8.1”,
“vocab_size”: 30522
}
The answer is a bit trickier than expected[Huge credits to Niels Rogge].
Firstly, loading models in huggingface-transformers can be done in (at least) two ways:
AutoModel.from_pretrained('./my_model_own_custom_training.pth', from_tf=False)
AutoModelForTokenClassification.from_pretrained('./my_model_own_custom_training.pth', from_tf=False)
It seems that, according to the task at hand, different AutoModels subclasses need to be used. In this scenario I posted, it is the AutoModelForTokenClassification() that has to be used.
After that, a solution to obtain the predictions would be to do the following:
# forward pass
outputs = model(**encoding)
logits = outputs.logits
predictions = logits.argmax(-1)

How can i work with Example for nlp.update problem with spacy3.0

i am trying to train my data with spacy v3.0 and appareantly the nlp.update do not accept any tuples. Here is the piece of code:
import spacy
import random
import json
nlp = spacy.blank("en")
ner = nlp.create_pipe("ner")
nlp.add_pipe('ner')
ner.add_label("label")
# Start the training
nlp.begin_training()
# Loop for 40 iterations
for itn in range(40):
# Shuffle the training data
random.shuffle(TRAINING_DATA)
losses = {}
# Batch the examples and iterate over them
for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
texts = [text for text, entities in batch]
annotations = [entities for text, entities in batch]
# Update the model
nlp.update(texts, annotations, losses=losses, drop=0.3)
print(losses)
and i am receiving error
ValueError Traceback (most recent call last)
<ipython-input-79-27d69961629b> in <module>
18 annotations = [entities for text, entities in batch]
19 # Update the model
---> 20 nlp.update(texts, annotations, losses=losses, drop=0.3)
21 print(losses)
~\Anaconda3\lib\site-packages\spacy\language.py in update(self, examples, _, drop, sgd, losses, component_cfg, exclude)
1086 """
1087 if _ is not None:
-> 1088 raise ValueError(Errors.E989)
1089 if losses is None:
1090 losses = {}
ValueError: [E989] `nlp.update()` was called with two positional arguments. This may be due to a backwards-incompatible change to the format of the training data in spaCy 3.0 onwards. The 'update' function should now be called with a batch of Example objects, instead of `(text, annotation)` tuples.
I set my train data format:
TRAINING_DATA = []
for entry in labeled_data:
entities = []
for e in entry['labels']:
entities.append((e[0], e[1],e[2]))
spacy_entry = (entry['text'], {"entities": entities})
TRAINING_DATA.append(spacy_entry)
My train data looks like this:
[('Part List', {'entities': []}), ('pending', {'entities': []}), ('3D Printing', {'entities': [(0, 11, 'Process')]}), ('Recommended to use a FDM 3D printer with PLA material.', {'entities': [(25, 36, 'Process'), (41, 44, 'Material')]}), ('', {'entities': []}), ('No need supports or rafts.', {'entities': []}), ('Resolution: 0.20mm', {'entities': []}), ('Fill density 20%', {'entities': []}), ('As follows from the analysis, part of the project is devoted to 3D', {'entities': [(64, 66, 'Process')]}), ('printing, as all static components were created using 3D modelling and', {'entities': [(54, 66, 'Process')]}), ('subsequent printing.', {'entities': []}), ('', {'entities': []}), ('In our project, we created several versions of the', {'entities': []}), ('model during modelling, which we will describe and document in the', {'entities': []}), ('following subchapters. As a tool for 3D modelling, we used the Sketchup', {'entities': [(37, 49, 'Process')]}), ('Make tool, version from 2017. The main reason was the high degree of', {'entities': []}), ('intuitiveness and simplicity of the tool, as we had not encountered 3D', {'entities': [(68, 70, 'Process')]}), ('modelling before and needed a relatively flexible and efficient tool to', {'entities': []}), ('guarantee the desired result. with zero previous experience.', {'entities': []}), ('In this version, which is shown in the figures Figure 13 - Version no. 2 side view and Figure 24 - Version no. 2 - front view, for the first time, the specific dimensions of the infuser were clarified and', {'entities': []}), ('modelled. The details of the lower servo attachment, the cable hole in', {'entities': []}), ('the main mast, the winding cylinder mounting, the protrusion on the', {'entities': [(36, 44, 'Process')]}), ('winding cylinder for holding the tea bag, the preparation for fitting', {'entities': []}), ('the wooden and aluminium plate and the shape of the cylinder end that', {'entities': [(15, 25, 'Material')]}), ('exactly fit the servo were also reworked.', {'entities': []}), ('After the creation of this', {'entities': []}), ('version of the model, this model was subsequently officially consulted', {'entities': []}), ('and commented on for the first time.', {'entities': []}), ('In this version, which is shown in the figures Figure 13 - Version no. 2 side view and Figure 24 - Version no. 2 - front view, for the first time, the specific dimensions of the infuser were clarified and', {'entities': []}), ('modelled. The details of the lower servo attachment, the cable hole in', {'entities': []}), ('the main mast, the winding cylinder mounting, the protrusion on the', {'entities': [(36, 44, 'Process')]})]
I would appreciate your help as a new contributor. Thanks a lot!
You didn't provide your TRAIN_DATA, so I cannot reproduce it. However, you should try something like this:
from spacy.training.example import Example
for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
for text, annotations in batch:
# create Example
doc = nlp.make_doc(text)
example = Example.from_dict(doc, annotations)
# Update the model
nlp.update([example], losses=losses, drop=0.3)
for batch in batches:
texts, annotations = zip(*batch)
example = []
# Update the model with iterating each text
for i in range(len(texts)):
doc = nlp.make_doc(texts[i])
example.append(Example.from_dict(doc, annotations[i]))
# Update the model
nlp.update(example, drop=0.5, losses=losses)
this code is running successfully with Spacy 3.
Note that here I had a tuple of string if you want to use only string don't need to use the for loop.
Since spaCy version 3.0, they have migrated from older “simple training style” to using Example object.
from spacy.training import Example
example = Example.from_dict(nlp.make_doc(text), annotations)
nlp.update([example])
You can refer to this page on official spaCy's website.
https://spacy.io/usage/training

Why Sklearn LDA topic model always suggest (choose) topic model with least topics?

I am doing topic modeling on text data (around 4000 news articles). For that, I am using the Sklearn LDA model. While doing this, I use GridSearchCV to choose the best model. However, in almost all cases, GridSearchCV suggests the least topic as the best model.
For example 1:
# Define Search Param
search_params = {'n_components': [5, 7, 10, 12, 15, 18, 20], 'learning_decay': [.5, .7, .9]}
# Init the Model
lda = LatentDirichletAllocation()
# Init Grid Search Class
model = GridSearchCV(lda, param_grid=search_params)
# Do the Grid Search
model.fit(data_vectorized)
The best model is suggested: 5
Example 2:
# Define Search Param
search_params = {'n_components': [3, 5, 7, 10, 12, 15, 18], 'learning_decay': [.5, .7, .9]}
# Init the Model
lda = LatentDirichletAllocation()
# Init Grid Search Class
model = GridSearchCV(lda, param_grid=search_params)
# Do the Grid Search
model.fit(data_vectorized)
The best model is suggested: 3
Is this normal or it is happening only to me?
What can be the possible reason for this?
Full Code is long that is why I am not giving it here but if required I can provide it.
Thanks in Advance.
I'd say it is simply that for your data, three topics is a better topic distribution than five topics. You didn't give the model a chance to test if three topics was any good in the first set of tests. So the answer you got was that of the choices [5, 7, 10, 12, 15, 18, 20] then 5 is best.
The problem is that your dataset might be to small, so the model can't learn deep enough about its subjacent topics.

SPACY custom NER is not returning any entity

I am trying to train a Spacy model to recognize a few custom NERs, the training data is given below, it is mostly related to recognizing a few server models, date in the FY format and Types of HDD:
TRAIN_DATA = [('Send me the number of units shipped in FY21 for A566TY server', {'entities': [(39, 42, 'DateParse'),(48,53,'server')]}),
('Send me the number of units shipped in FY-21 for A5890Y server', {'entities': [(39, 43, 'DateParse'),(49,53,'server')]}),
('How many systems sold with 3.5 inch drives in FY20-Q2 for F567?', {'entities': [(46, 52, 'DateParse'),(58,61,'server'),(27,29,'HDD')]}),
('Total revenue in FY20Q2 for 3.5 HDD', {'entities': [(17, 22, 'DateParse'),(28,30,'HDD')]}),
('How many systems sold with 3.5 inch drives in FY20-Q2 for F567?', {'entities': [(46, 52, 'DateParse'),(58,61,'server'),(27,29,'HDD')]}),
('Total units shipped in FY2017-FY2021', {'entities': [(23, 28, 'DateParse'),(30,35,'DateParse')]}),
('Total units shipped in FY 18', {'entities': [(23, 27, 'DateParse')]}),
('Total units shipped between FY16 and FY2021', {'entities': [(28, 31, 'DateParse'),(37,42,'DateParse')]})
]
def train_spacy(data,iterations):
TRAIN_DATA = data
nlp = spacy.blank('en') # create blank Language class
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
# add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get('entities'):
ner.add_label(ent[2])
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training()
for itn in range(iterations):
print("Statring iteration " + str(itn))
random.shuffle(TRAIN_DATA)
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(
texts, # batch of texts
annotations, # batch of annotations
drop=0.2, # dropout - make it harder to memorise data
losses=losses,
)
print("Losses", losses)
return nlp
But on running the code even on training data no entity is being returned.
prdnlp = train_spacy(TRAIN_DATA, 100)
for text, _ in TRAIN_DATA:
doc = prdnlp(text)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
The Output is coming as below:
Spacy can currently only train from entity annotation that lines up with token boundaries. The main problem is that your span end characters are one character too short. The character start/end values should be just like string slices for the text:
text = "Send me the number of units shipped in FY21 for A566TY server"
# (39, 42, 'DateParse')
assert text[39:42] == "FY2"
You should have (39, 43, 'DateParse') instead.
A secondary problem is that you may also need to adjust the tokenizer for cases like FY2017-FY2021 because the default English tokenizer treats this as one token, so the annotations [(23, 28, 'DateParse'),(30,35,'DateParse')] would be ignored during training.
See a more detailed explanation here: https://github.com/explosion/spaCy/issues/4946#issuecomment-580663925

How to train (append) new trained data with existing spacy model using python

I'm new for spacy and python ,by using below code i have create new customized model.But my requirement is how to append new trained data with existing (my customized) model.
TRAIN_DATA = [
('Who is Kofi Annan?', {
'entities': [(8, 18, 'people')]
}),
('Who is Steve Jobs?', {
'entities': [(7, 17, 'people')]
}),
('I like London and Berlin.', {
'entities': [(7, 13, 'location'), (18, 24, 'location')]
})
]
nlp = spacy.blank('en') # create blank Language class
print("Created blank 'en' model")
Above code create customized model, But i need is to append new trained data into existing model
To update an existing model you just need to load that model instead of the blank model and start from there:
nlp = spacy.load('en')
There are a few things to be aware of, so take a look at the usage guide here: https://spacy.io/usage/training#example-train-ner

Resources