Spacy, matcher with entities spanning more than a single token

Spacy, matcher with entities spanning more than a single token - python-3.x

I am trying to create a matcher that finds negated custom entities in the text. It is working fine for entities that span a single token, but I am having trouble trying to capture entities that span more than one token.
As an example, let's say that my custom entities are animals (and are labeled as token.ent_type_ = "animal")
["cat", "dog", "artic fox"] (note that the last entity has two words).
Now I want to find those entities in the text but negated, so I can create a simple matcher with the following pattern:
[{'lower': 'no'}, {'ENT_TYPE': {'REGEX': 'animal', 'OP': '+'}}]
And for example, I have the following text:
There is no cat in the house and no artic fox in the basement
I can successfully capture no cat and no artic, but the last match is incorrect as the full match should be no artic fox. This is due to the OP: '+' in the pattern that matches a single custom entity instead of two. How can I modify the pattern to prioritize longer matches over shorter ones?

A solution is to use the doc retokenize method in order to merge the individual tokens of each multi-token entity into a single token:
import spacy
from spacy.pipeline import EntityRuler
nlp = spacy.load('en_core_web_sm', parse=True, tag=True, entity=True)
animal = ["cat", "dog", "artic fox"]
ruler = EntityRuler(nlp)
for a in animal:
ruler.add_patterns([{"label": "animal", "pattern": a}])
nlp.add_pipe(ruler)
doc = nlp("There is no cat in the house and no artic fox in the basement")
with doc.retokenize() as retokenizer:
for ent in doc.ents:
retokenizer.merge(doc[ent.start:ent.end])
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
pattern =[{'lower': 'no'},{'ENT_TYPE': {'REGEX': 'animal', 'OP': '+'}}]
matcher.add('negated animal', None, pattern)
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
print(span)
the output is now:
no cat
no artic fox

Related

spaCy Matcher Rule not finding phrase in text

Please I have this spaCy matcher object I created with the rule to match an adjective and one or two nouns. Unfortunately, the expected output of beautiful design, smart search, *automatic labels, optional voice responses are not being returned. And I can't decipher what the problem is with my code.
doc = nlp("Features of the app include a beautiful design, smart search, automatic labels and optional voice responses.")
# Write a pattern for adjective plus one or two nouns
pattern = [{'POS': 'ADJ'}, {'POS': 'NOUN'}, {'POS': 'NOUN', 'OP': '?'}]
# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('ADJ_NOUN_PATTERN', [pattern])
matches = matcher(doc)
print('Total matches found:', len(matches))
# Iterate over the matches and print the span text
for match_id, start, end in matches:
print('Match found:', doc[start:end].text)

It worked for me. I used the large pipeline, 'en_core_web_lg'.
Which pipeline do you use ? And how do you declare your matcher ?
Here is my code :
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_lg')
matcher = Matcher(nlp.vocab)
# +Your code
doc = nlp("Features of the app include a beautiful design, smart search, automatic labels and optional voice responses.")
# Write a pattern for adjective plus one or two nouns
pattern = [{'POS': 'ADJ'}, {'POS': 'NOUN'}, {'POS': 'NOUN', 'OP': '?'}]
# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('ADJ_NOUN_PATTERN', [pattern])
matches = matcher(doc)
print('Total matches found:', len(matches))
# Iterate over the matches and print the span text
for match_id, start, end in matches:
print('Match found:', doc[start:end].text)

how to extract a PERSON named entity after certain word with spacy?

I have this text ( text2 in code), it has 3 'by' word, I want to use Spacy to extract the person's name (full name, even if it is 3 words, some races use long names, in this case 2). The code is below, my pattern shows error. My intention: first fix the 'by' word with ORTH, then to tell program that whatever coming next is the Part of Speech entity called PERSON. I would be happy if anyone help it:
import spacy
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
text2 = 'All is done by Emily Muller, the leaf is burned by fire. we were not happy, so we cut relations by saying bye bye'
def extract_person(nlp_doc):
pattern = [{'ORTH': 'by'}, {'POS': 'NOUN'}}]
# second possible pattern:
#pattern = [{"TEXT": "by"}, {"NER": "PERSON"}]
matcher.add('person_only', None, pattern)
matches = matcher(nlp_doc)
for match_id, start, end in matches:
span = nlp_doc[start:end]
return span.text
target_doc = nlp(text2)
extract_person(target_doc)
I think this question can be asked other way around: how to use NER tags in pattern in Matcher in spacy?

If you want to use whole names you should merge entities at the beginning. You can do it by calling: nlp.add_pipe("merge_entities", after="ner")
Then in your pattern instead of:
pattern = [{"TEXT": "by"}, {"NER": "PERSON"}]
Use:
pattern = [{"TEXT": "by"}, {"ENT_TYPE": "PERSON"}]
Complete code:
nlp.add_pipe("merge_entities", after="ner")
text2 = 'All is done by Emily Muller, the leaf is burned by fire. we were not happy, so we cut relations by saying bye bye'
doc = nlp(text2)
pattern = [{"TEXT": "by"}, {"ENT_TYPE": "PERSON"}]
matcher = Matcher(nlp.vocab)
matcher.add('person_only', [pattern])
matches = matcher(doc)
for match_id, start, end in matches:
print(doc[start:end])

Is it possible with spacy rule-based matching to match two key words with up to a certain number of wildcards in-between?

I am trying to match, for example, two keywords with up to five wildcards in-between. I could add five pattern with different numbers of wildcards, but this is not a good solution. Is there an option like {"OP": "+5"} or another solution?
Example:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a really nice, green apple. One apple a day ...!")
matcher = Matcher(nlp.vocab)
pattern = [{'ORTH': 'is'}, {"OP": "+"}, {"ORTH": "apple"} ]
matcher.add('test', None, pattern)
spans = [doc[start:end] for match_id, start, end in matcher(doc)]
for span in spans:
print(spans)
This gives two matches:
is a really nice, green apple and is a really nice, green apple. One apple
But I want the first one only. It should work in general, thus splitting sentences etc. is not a solution.

You can do it as follows:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a really nice, green apple. One apple a day ...!")
matcher = Matcher(nlp.vocab)
pattern = [{'ORTH': 'is'}]
for i in range(0,5):
pattern.append({"OP": "?"})
pattern.append({"ORTH": "apple"})
matcher.add('test', None, pattern)
spans = [doc[start:end] for match_id, start, end in matcher(doc)]
for span in spans:
print(spans)
# [is a really nice, green apple]

Spacy - Tokenize quoted string

I am using spacy 2.0 and using a quoted string as input.
Example string
"The quoted text 'AA XX' should be tokenized"
and expecting to extract
[The, quoted, text, 'AA XX', should, be, tokenized]
I however get some strange results while experimenting. Noun chunks and ents looses one of the quote.
import spacy
nlp = spacy.load('en')
s = "The quoted text 'AA XX' should be tokenized"
doc = nlp(s)
print([t for t in doc])
print([t for t in doc.noun_chunks])
print([t for t in doc.ents])
Result
[The, quoted, text, ', AA, XX, ', should, be, tokenized]
[The quoted text 'AA XX]
[AA XX']
What is the best way to address what I need

While you could modify the tokenizer and add your own custom prefix, suffix and infix rules that exclude quotes, I'm not sure this is the best solution here.
For your use case, it might make more sense to add a component to your pipeline that merges (certain) quoted strings into one token before the tagger, parser and entity recognizer are called. To accomplish this, you can use the rule-based Matcher and find combinations of tokens surrounded by '. The following pattern looks for one or more alphanumeric characters:
pattern = [{'ORTH': "'"}, {'IS_ALPHA': True, 'OP': '+'}, {'ORTH': "'"}]
Here's a visual example of the pattern in the interactive matcher demo. To do the merging, you can then set up the Matcher, add the pattern and write a function that takes a Doc object, extracts the matched spans and merges them into one token by calling their .merge method.
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en')
matcher = Matcher(nlp.vocab)
matcher.add('QUOTED', None, [{'ORTH': "'"}, {'IS_ALPHA': True, 'OP': '+'}, {'ORTH': "'"}])
def quote_merger(doc):
# this will be called on the Doc object in the pipeline
matched_spans = []
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
matched_spans.append(span)
for span in matched_spans: # merge into one token after collecting all matches
span.merge()
return doc
nlp.add_pipe(quote_merger, first=True) # add it right after the tokenizer
doc = nlp("The quoted text 'AA XX' should be tokenized")
print([token.text for token in doc])
# ['The', 'quoted', 'text', "'AA XX'", 'should', 'be', 'tokenized']
For a more elegant solution, you can also refactor the component as a reusable class that sets up the matcher in its __init__ method (see the docs for examples).
If you add the component first in the pipeline, all other components like the tagger, parser and entity recognizer will only get to see the retokenized Doc. That's also why you might want to write more specific patterns that only merge certain quoted strings you care about. In your example, the new token boundaries improve the predictions – but I can also think of many other cases where they don't, especially if the quoted string is longer and contains a significant part of the sentence.

How to logically segment a sentence using spacy ?

I am new to Spacy and trying to segment a sentence logically, so that I can process each part separately. e.g;
"If the country selected is 'US', then the zip code should be numeric"
This needs to be broken into :
If the country selected is 'US',
then the zip code should be numeric
Another sentence with comas should not be broken:
The allowed states are NY, NJ and CT
Any ideas, thoughts how to do this in spacy ?

I am not sure whether we can do this until we train the model using custom data. But spacy allows to add rules for tokenising and sentence segmenting etc..
The following code may be useful for this particular case and you can change the rules according your requirement.
#Importing spacy and Matcher to merge matched patterns
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en')
#Defining pattern i.e any text surrounded with '' should be merged into single token
matcher = Matcher(nlp.vocab)
pattern = [{'ORTH': "'"},
{'IS_ALPHA': True},
{'ORTH': "'"}]
#Adding pattern to the matcher
matcher.add('special_merger', None, pattern)
#Method to merge matched patterns
def special_merger(doc):
matched_spans = []
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
matched_spans.append(span)
for span in matched_spans:
span.merge()
return doc
#To determine whether a token can be start of the sentence.
def should_sentence_start(doc):
for token in doc:
if should_be_sentence_start(token):
token.is_sent_start = True
return doc
#Defining rule such that, if previous toke is "," and previous to previous token is "'US'"
#Then current token should be start of the sentence.
def should_be_sentence_start(token):
if token.i >= 2 and token.nbor(-1).text == "," and token.nbor(-2).text == "'US'" :
return True
else:
return False
#Adding matcher and sentence tokenizing to nlp pipeline.
nlp.add_pipe(special_merger, first=True)
nlp.add_pipe(should_sentence_start, before='parser')
#Applying NLP on requried text
sent_texts = "If the country selected is 'US', then the zip code should be numeric"
doc = nlp(sent_texts)
for sent in doc.sents:
print(sent)
Output:
If the country selected is 'US',
then the zip code should be numeric

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spacy, matcher with entities spanning more than a single token - python-3.x

Related

spaCy Matcher Rule not finding phrase in text

how to extract a PERSON named entity after certain word with spacy?

Is it possible with spacy rule-based matching to match two key words with up to a certain number of wildcards in-between?

Spacy - Tokenize quoted string

How to logically segment a sentence using spacy ?

Categories

Resources