How to merge entities in spaCy via rules - nlp

I want to use some of the entities in spaCy 3's en_core_web_lg, but replace some of what it labeled as 'ORG' as 'ANALYTIC', as it treats the 3 char codes I want to use such as 'P&L' and 'VaR' as organizations. The model has DATE entities, which I'm fine to preserve. I've read all the docs, and it seems like I should be able to use the EntityRuler, with the syntax below, but I'm not getting anywhere. I have been through the training 2-3x now, read all the Usage and API docs, and I just don't see any examples of working code. I get all sorts of different error messages like I need a decorator, or other. Lord, is it really that hard?
my code:
analytics = [
[{'LOWER':'risk'}],
[{'LOWER':'pnl'}],
[{'LOWER':'p&l'}],
[{'LOWER':'return'}],
[{'LOWER':'returns'}]
]
matcher = Matcher(nlp.vocab)
matcher.add("ANALYTICS", analytics)
doc = nlp(text)
# Iterate over the matches
for match_id, start, end in matcher(doc):
# Create a Span with the label for "ANALYTIC"
span = Span(doc, start, end, label="ANALYTIC")
# Overwrite the doc.ents and add the span
doc.ents = list(doc.ents) + [span]
# Get the span's root head token
span_root_head = span.root.head
# Print the text of the span root's head token and the span text
print(span_root_head.text, "-->", span.text)
This of course crashes when my new 'ANALYTIC' entity span collides with the existing 'ORG' one. But I have no idea how to either merge these offline and put them back, or create my own custom pipeline using rules. This is the suggested text from the entity ruler. No clue.
# Construction via add_pipe
ruler = nlp.add_pipe("entity_ruler")
# Construction from class
from spacy.pipeline import EntityRuler
ruler = EntityRuler(nlp, overwrite_ents=True)

So when you say it "crashes", what's happening is that you have conflicting spans. For doc.ents specifically, each token can only be in at most one span. In your case you can fix this by modifying this line:
doc.ents = list(doc.ents) + [span]
Here you've included both the old span (that you don't want) and the new span. If you get doc.ents without the old span this will work.
There are also other ways to do this. Here I'll use a simplified example where you always want to change items of length 3, but you can modify this to use your list of specific words or something else.
You can directly modify entity labels, like this:
for ent in doc.ents:
if len(ent.text) == 3:
ent.label_ = "CHECK"
print(ent.label_, ent, sep="\t")
If you want to use the EntityRuler it would look like this:
import spacy
nlp = spacy.load("en_core_web_sm")
ruler = nlp.add_pipe("entity_ruler", config={"overwrite_ents":True})
patterns = [
{"label": "ANALYTIC", "pattern":
[{"ENT_TYPE": "ORG", "LENGTH": 3}]}]
ruler.add_patterns(patterns)
text = "P&L reported amazing returns this year."
doc = nlp(text)
for ent in doc.ents:
print(ent.label_, ent, sep="\t")
One more thing - you don't say what version of spaCy you're using. I'm using spaCy v3 here. The way pipes are added changed a bit in v3.

Related

Transformers get named entity prediction for words instead of tokens

This is very basic question, but I spend hours struggling to find the answer. I built NER using Hugginface transformers.
Say I have input sentence
input = "Damien Hirst oil in canvas"
I tokenize it to get
tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-uncased')
tokenized = tokenizer.encode(input) #[101, 12587, 7632, 12096, 3514, 1999, 10683, 102]
Feed tokenized sentence to the model to get predicted tags for the tokens
['B-ARTIST' 'B-ARTIST' 'I-ARTIST' 'I-ARTIST' 'B-MEDIUM' 'I-MEDIUM'
'I-MEDIUM' 'B-ARTIST']
prediction comes as output from the model. It assigns tags to different tokens.
How can I recombine this data to obtain tags for words instead of tokens? So I would know that
"Damien Hirst" = ARTIST
"Oil in canvas" = MEDIUM
There are two questions here.
Annotating Token Classification
A common sequential tagging, especially in Named Entity Recognition, follows the scheme that a sequence to tokens with tag X at the beginning gets B-X and on reset of the labels it gets I-X.
The problem is that most annotated datasets are tokenized with space! For example:
[CSL] O
Damien B-ARTIST
Hirst I-ARTIST
oil B-MEDIUM
in I-MEDIUM
canvas I-MEDIUM
[SEP] O
where O indicates that it is not a named-entity, B-ARTIST is the beginning of the sequence of tokens labelled as ARTIST and I-ARTIST is inside the sequence - similar pattern for MEDIUM.
At the moment I posted this answer, there is an example of NER in huggingface documentation here:
https://huggingface.co/transformers/usage.html#named-entity-recognition
The example doesn't exactly answer the question here, but it can add some clarification. The similar style of named entity labels in that example could be as follows:
label_list = [
"O", # not a named entity
"B-ARTIST", # beginning of an artist name
"I-ARTIST", # an artist name
"B-MEDIUM", # beginning of a medium name
"I-MEDIUM", # a medium name
]
Adapt Tokenizations
With all that said about annotation schema, BERT and several other models have different tokenization model. So, we have to adapt these two tokenizations.
In this case with bert-base-uncased, the expected outcome is like this:
damien B-ARTIST
hi I-ARTIST
##rst I-ARTIST
oil B-MEDIUM
in I-MEDIUM
canvas I-MEDIUM
In order to get this done, you can go through each token in original annotation, then tokenize it and add its label again:
tokens_old = ['Damien', 'Hirst', 'oil', 'in', 'canvas']
labels_old = ["B-ARTIST", "I-ARTIST", "B-MEDIUM", "I-MEDIUM", "I-MEDIUM"]
label2id = {label: idx for idx, label in enumerate(label_list)}
tokens, labels = zip(*[
(token, label)
for token_old, label in zip(tokens_old, labels_old)
for token in tokenizer.tokenize(token_old)
])
When you add [CLS] and [SEP] in the tokens, their labels "O" must be added to labels.
With the code above, it is possible to get into a situation that a beginning tag like B-ARTIST get repeated when the beginning word splits into pieces. According to the description in huggingface documentation, you can encode these labels with -100 to be ignored:
https://huggingface.co/transformers/custom_datasets.html#token-classification-with-w-nut-emerging-entities
Something like this should work:
tokens, labels = zip(*[
(token, label2id[label] if (label[:2] != "B-" or i == 0) else -100)
for token_old, label in zip(tokens_old, labels_old)
for i, token in enumerate(tokenizer.tokenize(token_old))
])

spaCy issue with 'Vocab' or 'StringStore'

I am training the Rasa NLU using spaCy for the pipeline, but when I try to train it I get this error from spaCy:
KeyError: "[E018] Can't retrieve string for hash '18446744072967274715'. This usually refers to an issue with the `Vocab` or `StringStore`."
I have python 3.7.3, spaCy version is 2.2.3, rasa version 1.6.1
Does someone knows how to fix this issue?
that's Sounds like a named mistake, I guess you applied a matcher for a text on another one, and the matcher_id became different, so that ist's getconfused.
to solve it make sure that you use the same matcher on the same text, like below:
Perform standard imports, reset nlp , PhraseMatcher library
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)
dd = 'refers to the economic policies associated with supply-side economics, voodoo economics'
doc3 = nlp(dd) # convert string to spacy.tokens.doc.Doc
First, create a list of match phrases:
phrase_list = ['voodoo economics', 'supply-side economics', 'free-market economics']
Next, convert each phrase to a Doc object:
phrase_patterns = [nlp(text) for text in phrase_list]
Pass each Doc object into matcher (note the use of the asterisk!):
matcher.add('VoodooEconomics', None, *phrase_patterns)
Build a list of matches:
matches = matcher(doc3)
matches #(match_id, start, end)
Viewing Matches:
for match_id, start, end in matches: # the matcher have to be the same one that we build on this text
string_id = nlp.vocab.strings[match_id]
span = doc3[start:end]
print(match_id, string_id, start, end, span.text)

spacy-lookup punctuation interference

Spacy-lookup is an entity matcher for very large dictionaries, which uses the FlashText module.
It seems that punctuation in the second case below prevents it from matching the entity.
Does someone know why this occurs and how it can be solved?
import spacy
from spacy_lookup import Entity
nlp = spacy.load("en_core_web_sm", disable = ['NER'])
entity = Entity(keywords_list=['vitamin D'])
nlp.add_pipe(entity, last=True)
#works for this sentence:
doc = nlp("vitamin D is contained in this.")
print([token.text for token in doc if token._.is_entity])
#['vitamin D']
#does not work for this sentence:
doc = nlp("This contains vitamin D.")
print([token.text for token in doc if token._.is_entity])
#[]
edit: interestingly, this does not occur when one directly uses the flashtext library (upon which spacy-lookup is based) :
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword('vitamin D')
keywords_found = keyword_processor.extract_keywords("This contains vitamin D.", span_info=True)
print(keywords_found)
# [('vitamin D', 14, 23)]
edit : As Anwarvic pointed out, the problem comes from the way the default tokenizer is splitting the string.
edit : I am trying to find a general solution which does not for example involve adding spaces before every punctuation point. (basically looking for a solution which does not involve reformatting the input text)
The solution is pretty simple ... put space after "D" like so:
>>> doc = nlp("This contains vitamin D .") #<-- space after D
>>> print([token.text for token in doc if token._.is_entity])
['vitamin D']
Why does it happen? Simply because spaCy considered "D." as a whole token the same way the "D." in the name "D. Cooper" is considered a whole token!

Is there a way to turn off specific built-in tokenization rules in Spacy?

Spacy automatically tokenizes word contractions such as "dont" and "don't" into "do" and "nt"/"n't". For instance, a sentence like "I dont understand" would be tokenized into: ["I", "do", "nt", "understand"].
I understand this is usually helpful in many NLP tasks, but is there a way to suppress this special tokenization rule in Spacy such that the result is ["I", "dont", "understand"] instead?
This is because I am trying to evaluate the performance (f1-score for BIO tagging scheme) of my custom Spacy NER model, and the mismatch in the number of tokens in the input sentence and the number of predicated token tags is causing problems for my evaluation code down the line:
Input (3 tokens): [("I", "O"), ("dont", "O"), ("understand", "O")]
Predicted (4 tokens): [("I", "O"), ("do", "O"), ("nt", "O"), ("understand", "O")]
Of course, if anyone has any suggestions for a better way to perform evaluation on sequential tagging tasks in Spacy (perhaps like the seqeval package but more compatible with Spacy's token format), that would be greatly appreciated as well.
The special-case tokenization rules are defined in the tokenizer_exceptions.py in the respective language data (see here for the English "nt" contractions). When you create a new Tokenizer, those special case rules can be passed in via the rules argument.
Approach 1: Custom tokenizer with different special case rules
So one thing you could do for your use case is to reconstruct the English Tokenizer with the same prefix, suffix and infix rules, but with only a filtered set of tokenizer exceptions. Tokenizer exceptions are keyed by the string, so you could remove the entries for "dont" and whatever else you need. However, the code is quite verbose, since you're reconstructing the whole tokenizer:
from spacy.lang.en import English
from spacy.lang.punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES
from spacy.lang.en import TOKENIZER_EXCEPTIONS
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_suffix_regex, compile_infix_regex
prefix_re = compile_prefix_regex(TOKENIZER_PREFIXES).search
suffix_re = compile_suffix_regex(TOKENIZER_SUFFIXES).search
infix_re = compile_infix_regex(TOKENIZER_INFIXES).finditer
filtered_exc = {key: value for key, value in TOKENIZER_EXCEPTIONS.items() if key not in ["dont"]}
nlp = English()
tokenizer = Tokenizer(
nlp.vocab,
prefix_search=prefix_re,
suffix_search=suffix_re,
infix_finditer=infix_re,
rules=filtered_exc
)
nlp.tokenizer = tokenizer
doc = nlp("I dont understand")
Approach 2: Merge (or split) tokens afterwards
An alternative aproach would be to keep the tokenization as it is, but add rules on top that merge certain tokens back together aftwards to match the desired tokenization. This is obviously going to be slower at runtime, but it might be easier to implement and reason about, because you can approach it from the perspective of "Which tokens are currently separated but should be one?". For this, you could use the rule-based Matcher and the retokenizer to merge the matched tokens back together. As of spaCy v2.1, it also supports splitting, in case that's relevant.
from spacy.lang.en import English
from spacy.matcher import Matcher
nlp = English()
matcher = Matcher(nlp.vocab)
patterns = [[{"LOWER": "do"}, {"LOWER": "nt"}]]
matcher.add("TO_MERGE", None, *patterns)
doc = nlp("I dont understand")
matches = matcher(doc)
with doc.retokenize() as retokenizer:
for match_id, start, end in matches:
span = doc[start:end]
retokenizer.merge(span)
The above pattern would match two tokens (one dict per token), whose lowercase forms are "do" and "nt" (e.g. "DONT", "dont", "DoNt"). You can add more lists of dicts to the patterns to describe other sequences of tokens. For each match, you can then create a Span and merge it into one token. To make this logic more elegant, you could also wrap it as a custom pipeline component, so it's applied automatically when you call nlp on a text.

Custom NER Spacy throwing IndexError: list index out of range

I am trying to create custom NER using Spacy but, while training, I am getting the following error:
gold = GoldParse(doc, entities=entity_offsets)
File "gold.pyx", line 565, in spacy.gold.GoldParse.init
IndexError: list index out of range
Any idea as to how I can fix this?
The most common resolution that came up after doing some google search was to trim leading and trailing white spaces in the training data. So I used this code to trim them off. But still was of no use.
'''
invalid_span_tokens = re.compile(r'\s')
cleaned_data = []
for text, annotations in data:
entities = annotations['entities']
valid_entities = []
for start, end, label in entities:
valid_start = start
valid_end = end
while valid_start < len(text) and invalid_span_tokens.match(
text[valid_start]):
valid_start += 1
while valid_end > 1 and invalid_span_tokens.match(
text[valid_end - 1]):
valid_end -= 1
valid_entities.append([valid_start, valid_end, label])
cleaned_data.append([text, {'entities': valid_entities}])
'''
Ah, so the words would be a keyword argument on the GoldParse object. This lets you specify the gold-standard tokenization, if it doesn't match spaCy's tokenization. Assuming your input looks like this:
text = 'helloworld'
words = ['hello', 'world']
tags = ['INTJ', 'NOUN']
You can do the following:
doc = Doc(text)
gold = GoldParse(doc, words=words, tags=tags)
nlp.update([doc], [gold])
Alternatively, you can also use the new "simple training style" and just pass in the text as a string, and the annotations as a dictionary:
nlp.update([text], [{'words': words, 'tags': tags}])
In general, we'd recommend using the simple style, as it removes one layer of abstraction, and lets you get rid of the Doc and GoldParse imports. But ultimately, the style you choose depends on your personal preference.

Resources