spaCy Matcher Rule not finding phrase in text

spaCy Matcher Rule not finding phrase in text - python-3.x

Please I have this spaCy matcher object I created with the rule to match an adjective and one or two nouns. Unfortunately, the expected output of beautiful design, smart search, *automatic labels, optional voice responses are not being returned. And I can't decipher what the problem is with my code.
doc = nlp("Features of the app include a beautiful design, smart search, automatic labels and optional voice responses.")
# Write a pattern for adjective plus one or two nouns
pattern = [{'POS': 'ADJ'}, {'POS': 'NOUN'}, {'POS': 'NOUN', 'OP': '?'}]
# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('ADJ_NOUN_PATTERN', [pattern])
matches = matcher(doc)
print('Total matches found:', len(matches))
# Iterate over the matches and print the span text
for match_id, start, end in matches:
print('Match found:', doc[start:end].text)

It worked for me. I used the large pipeline, 'en_core_web_lg'.
Which pipeline do you use ? And how do you declare your matcher ?
Here is my code :
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_lg')
matcher = Matcher(nlp.vocab)
# +Your code
doc = nlp("Features of the app include a beautiful design, smart search, automatic labels and optional voice responses.")
# Write a pattern for adjective plus one or two nouns
pattern = [{'POS': 'ADJ'}, {'POS': 'NOUN'}, {'POS': 'NOUN', 'OP': '?'}]
# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('ADJ_NOUN_PATTERN', [pattern])
matches = matcher(doc)
print('Total matches found:', len(matches))
# Iterate over the matches and print the span text
for match_id, start, end in matches:
print('Match found:', doc[start:end].text)

Related

How to make a spacy matcher pattern using verb tense/mood?

I've been trying to make a specific pattern for a spacy matcher using Verbs tenses and moods.
I found out how to access morphological features of words parsed with spacy using model.vocab.morphology.tag_map[token.tag_], which prints out something like this when the verb is in subjunctive mode (the mode I am interested in):
{'Mood_sub': True, 'Number_sing': True, 'Person_three': True, 'Tense_pres': True, 'VerbForm_fin': True, 74: 100}
however, I would like to have a pattern like this one to retokenize specific verb phrases:
pattern = [{'TAG':'Mood_sub'}, {'TAG':'VerbForm_ger'}]
In the case of a spanish phrase like: 'Que siga aprendiendo', 'siga' has 'Mood_sub' = True in its tag, and 'aprendiendo' has 'VerbForm_ger' = True in its tag. However, the matcher is not detecting this match.
Can anyone tell me why this is and how I could fix it? This is the code I am using:
model = spacy.load('es_core_news_md')
text = 'Que siga aprendiendo de sus alumnos'
doc = model(text)
pattern = [{'TAG':'Mood_sub'}, {'TAG':'VerbForm_ger'}]
matcher.add(1, None, pattern)
matches = matcher(doc)
for i, start, end in matches:
span = doc[start:end]
if len(span) > 0:
with doc.retokenize() as retokenizer:
retokenizer.merge(span)

The morph support isn't fully implemented in spacy v2, so this is not possible using the direct morph values like Mood_sub.
Instead, I think the best option with the Matcher to is use REGEX over the combined/extended TAG values. It's not going to be particularly elegant, but it should work:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('es_core_news_sm')
doc = nlp("Que siga aprendiendo de sus alumnos")
assert doc[1].tag_ == "AUX__Mood=Sub|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin"
matcher = Matcher(nlp.vocab)
matcher.add("MOOD_SUB", [[{"TAG": {"REGEX": ".*Mood=Sub.*"}}]])
assert matcher(doc) == [(513366231240698711, 1, 2)]

Is it possible with spacy rule-based matching to match two key words with up to a certain number of wildcards in-between?

I am trying to match, for example, two keywords with up to five wildcards in-between. I could add five pattern with different numbers of wildcards, but this is not a good solution. Is there an option like {"OP": "+5"} or another solution?
Example:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a really nice, green apple. One apple a day ...!")
matcher = Matcher(nlp.vocab)
pattern = [{'ORTH': 'is'}, {"OP": "+"}, {"ORTH": "apple"} ]
matcher.add('test', None, pattern)
spans = [doc[start:end] for match_id, start, end in matcher(doc)]
for span in spans:
print(spans)
This gives two matches:
is a really nice, green apple and is a really nice, green apple. One apple
But I want the first one only. It should work in general, thus splitting sentences etc. is not a solution.

You can do it as follows:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a really nice, green apple. One apple a day ...!")
matcher = Matcher(nlp.vocab)
pattern = [{'ORTH': 'is'}]
for i in range(0,5):
pattern.append({"OP": "?"})
pattern.append({"ORTH": "apple"})
matcher.add('test', None, pattern)
spans = [doc[start:end] for match_id, start, end in matcher(doc)]
for span in spans:
print(spans)
# [is a really nice, green apple]

Ignoring filler words in part of speech pattern NLTK

I have rule based text matching program that I've written that operates based on rules created using specific POS patterns. So for example one rule is:
pattern = [('PRP', "i'll"), ('VB', ('jump', 'play', 'bite', 'destroy'))]
In this case when analyzing my input text this will only return results in a string that fit grammatically to this specific pattern so:
I'll jump
I'll play
I'll bite
I'll destroy
My question involves extracting the same meaning of the text when people use the same text but add a superlative or any type of word that doesn't change context, right now it only does exact matches, but wont catch phrases like the first string in this example:
I'll 'freaking' jump
'Dammit' I'll play
I'll play 'dammit'
The word doesn't have have to be specific its just making sure the program can still identify the same pattern with the addition of a non-contextual superlative or any other type of word with the same purpose. This is the flagger I've written and I've given an example string:
string_list = [('Its', 'PRP$'), ('annoying', 'NN'), ('when', 'WRB'), ('a', 'DT'), ('kid', 'NN'), ('keeps', 'VBZ'), ('asking', 'VBG'), ('you', 'PRP'), ('to', 'TO'), ('play', 'VB'), ('but', 'CC'), ("I'll", 'NNP'), ('bloody', 'VBP'), ('play', 'VBP'), ('so', 'RB'), ('it', 'PRP'), ('doesnt', 'VBZ'), ('cry', 'NN')]
def find_match_pattern(string_list, pattern_dict):
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer() # does a sentiment analysis on the output string
filt_ = ['Filter phrases'] # not the patterns just phrases I know I dont want
filt_tup = [x.lower() for x in filt_]
for rule, pattern in pattern_dict.items(): # pattern dict is an Ordered Diction courtesy of collections
num_matched = 0
for idx, tuple in enumerate(string_list): # string_list is the input string that has been POS tagged
matched = False
if tuple[1] == list(pattern.keys())[num_matched]:
if tuple[0] in pattern[tuple[1]]:
num_matched += 1
else:
num_matched = 0
else:
num_matched = 0
if num_matched == len(pattern): # if the number of matching words equals the length of the pattern do this
matched_string = ' '.join([i[0] for i in string_list]) # Joined for the sentiment analysis score
vs = analyzer.polarity_scores(matched_string)
sentiment = vs['compound']
if matched_string in filt_tup:
break
elif (matched_string not in filt_tup) or (sentiment < -0.8):
matched = True
print(matched, '\n', matched_string, '\n', sentiment)
return (matched, sentiment, matched_string, rule)
I know its a really abstract (or down the rabbit hole) question, so it may be a discussion but if anyone has experience with this it would be awesome to see what you recommend.

Your question can be answered using Spacy's dependecy tagger. Spacy provides a matcher with many optional and switchable options.
In the case below, instead of basing on specific words or Parts of Speech, the focus was looking at certain sintatic functions, such as the nominal subject and the auxiliary verbs.
Here's a quick example:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en')
matcher = Matcher(nlp.vocab, validate=True)
pattern = [{'DEP': 'nsubj', 'OP': '+'}, # OP + means it has to be at least one nominal subject - usually a pronoun
{'DEP': 'aux', 'OP': '?'}, # OP ? means it can have one or zero auxiliary verbs
{'POS': 'ADV', 'OP': '?'}, # Now it looks for an adverb. Also, it is not needed (OP?)
{'POS': 'VERB'}] # Finally, I've generallized it with a verb, but you can make one pattern for each verb or write a loop to do it.
matcher.add("NVAV", None, pattern)
phrases = ["I\'ll really jump.",
"Okay, I\'ll play.",
"Dammit I\'ll play",
"I\'ll play dammit",
"He constantly plays it",
"She usually works there"]
for phrase in phrases:
doc = nlp(phrase)
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
print('Matched:',span.text)
Matched: I'll really jump
Matched: I'll play
Matched: I'll play
Matched: I'll play
Matched: He constantly plays
Matched: She usually works
You can always test your patterns in the live example: Spacy Live Example
You can extend it as you will. Read more here:https://spacy.io/usage/rule-based-matching

Spacy, matcher with entities spanning more than a single token

I am trying to create a matcher that finds negated custom entities in the text. It is working fine for entities that span a single token, but I am having trouble trying to capture entities that span more than one token.
As an example, let's say that my custom entities are animals (and are labeled as token.ent_type_ = "animal")
["cat", "dog", "artic fox"] (note that the last entity has two words).
Now I want to find those entities in the text but negated, so I can create a simple matcher with the following pattern:
[{'lower': 'no'}, {'ENT_TYPE': {'REGEX': 'animal', 'OP': '+'}}]
And for example, I have the following text:
There is no cat in the house and no artic fox in the basement
I can successfully capture no cat and no artic, but the last match is incorrect as the full match should be no artic fox. This is due to the OP: '+' in the pattern that matches a single custom entity instead of two. How can I modify the pattern to prioritize longer matches over shorter ones?

A solution is to use the doc retokenize method in order to merge the individual tokens of each multi-token entity into a single token:
import spacy
from spacy.pipeline import EntityRuler
nlp = spacy.load('en_core_web_sm', parse=True, tag=True, entity=True)
animal = ["cat", "dog", "artic fox"]
ruler = EntityRuler(nlp)
for a in animal:
ruler.add_patterns([{"label": "animal", "pattern": a}])
nlp.add_pipe(ruler)
doc = nlp("There is no cat in the house and no artic fox in the basement")
with doc.retokenize() as retokenizer:
for ent in doc.ents:
retokenizer.merge(doc[ent.start:ent.end])
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
pattern =[{'lower': 'no'},{'ENT_TYPE': {'REGEX': 'animal', 'OP': '+'}}]
matcher.add('negated animal', None, pattern)
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
print(span)
the output is now:
no cat
no artic fox

Spacy - Tokenize quoted string

I am using spacy 2.0 and using a quoted string as input.
Example string
"The quoted text 'AA XX' should be tokenized"
and expecting to extract
[The, quoted, text, 'AA XX', should, be, tokenized]
I however get some strange results while experimenting. Noun chunks and ents looses one of the quote.
import spacy
nlp = spacy.load('en')
s = "The quoted text 'AA XX' should be tokenized"
doc = nlp(s)
print([t for t in doc])
print([t for t in doc.noun_chunks])
print([t for t in doc.ents])
Result
[The, quoted, text, ', AA, XX, ', should, be, tokenized]
[The quoted text 'AA XX]
[AA XX']
What is the best way to address what I need

While you could modify the tokenizer and add your own custom prefix, suffix and infix rules that exclude quotes, I'm not sure this is the best solution here.
For your use case, it might make more sense to add a component to your pipeline that merges (certain) quoted strings into one token before the tagger, parser and entity recognizer are called. To accomplish this, you can use the rule-based Matcher and find combinations of tokens surrounded by '. The following pattern looks for one or more alphanumeric characters:
pattern = [{'ORTH': "'"}, {'IS_ALPHA': True, 'OP': '+'}, {'ORTH': "'"}]
Here's a visual example of the pattern in the interactive matcher demo. To do the merging, you can then set up the Matcher, add the pattern and write a function that takes a Doc object, extracts the matched spans and merges them into one token by calling their .merge method.
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en')
matcher = Matcher(nlp.vocab)
matcher.add('QUOTED', None, [{'ORTH': "'"}, {'IS_ALPHA': True, 'OP': '+'}, {'ORTH': "'"}])
def quote_merger(doc):
# this will be called on the Doc object in the pipeline
matched_spans = []
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
matched_spans.append(span)
for span in matched_spans: # merge into one token after collecting all matches
span.merge()
return doc
nlp.add_pipe(quote_merger, first=True) # add it right after the tokenizer
doc = nlp("The quoted text 'AA XX' should be tokenized")
print([token.text for token in doc])
# ['The', 'quoted', 'text', "'AA XX'", 'should', 'be', 'tokenized']
For a more elegant solution, you can also refactor the component as a reusable class that sets up the matcher in its __init__ method (see the docs for examples).
If you add the component first in the pipeline, all other components like the tagger, parser and entity recognizer will only get to see the retokenized Doc. That's also why you might want to write more specific patterns that only merge certain quoted strings you care about. In your example, the new token boundaries improve the predictions – but I can also think of many other cases where they don't, especially if the quoted string is longer and contains a significant part of the sentence.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

spaCy Matcher Rule not finding phrase in text - python-3.x

Related

How to make a spacy matcher pattern using verb tense/mood?

Is it possible with spacy rule-based matching to match two key words with up to a certain number of wildcards in-between?

Ignoring filler words in part of speech pattern NLTK

Spacy, matcher with entities spanning more than a single token

Spacy - Tokenize quoted string

Categories

Resources