Spacy Extracting mentions with Matcher - python-3.x

everyone I am trying to match a sentence into a bigger sentence using Spacy rule-matcher, but the output is empty.
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_md")
doc1 = nlp("DUMMY TEXT CafeA is very generous with the portions. DUMMY TEXT DUMMY TEXT")
pattern = [{"ENT_TYPE": "ORG"}, {"LEMMA": "be"}, {"POS": "ADV", "OP": "*"}, {"POS": "ADJ"}]
matcher = Matcher(nlp.vocab)
matcher.add("mentions",[pattern])
for mid, start, end in matcher(doc1):
print(start, end, doc1[start:end])
The idea of the rule is to match the "CafeA is very generous with the portions." bit, but I do not get any result. What is the correct way to do this in spacy?
Any help will be appreciated

Your code produces 2 6 CafeA is very generous when I run it on my machine (Windows 11 64-bit, Python 3.10.9, spaCy 3.4.4) using both the en_core_web_md and en_core_web_trf pipelines. As a side note, "CafeA" is not tagged as an organisation when using en_core_web_sm, and therefore the pattern does not match.
If you want include "with the portions", you'll need to expand the pattern to include the appropriate PoS tags (i.e. ADP (adposition), DET (determiner) and NOUN (noun) respectively). For example:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_md")
doc1 = nlp("DUMMY TEXT CafeA is very generous with the portions. DUMMY TEXT DUMMY TEXT")
pattern = [{"ENT_TYPE": "ORG"}, {"LEMMA": "be"}, {"POS": "ADV", "OP": "*"}, {"POS": "ADJ"}, {"POS": "ADP"}, {"POS": "DET"}, {"POS": "NOUN"}]
for tok in doc1:
print(tok.text, tok.pos_)
print(tok.ent_type_)
matcher = Matcher(nlp.vocab)
matcher.add("mentions", [pattern])
for mid, start, end in matcher(doc1):
print(start, end, doc1[start:end])
Matches and prints the following:
2 9 CafeA is very generous with the portions
If you're getting unexpected results with the matcher, try restarting your IDE or clearing any cached/stored variables.

Related

How to make a spacy matcher pattern using verb tense/mood?

I've been trying to make a specific pattern for a spacy matcher using Verbs tenses and moods.
I found out how to access morphological features of words parsed with spacy using model.vocab.morphology.tag_map[token.tag_], which prints out something like this when the verb is in subjunctive mode (the mode I am interested in):
{'Mood_sub': True, 'Number_sing': True, 'Person_three': True, 'Tense_pres': True, 'VerbForm_fin': True, 74: 100}
however, I would like to have a pattern like this one to retokenize specific verb phrases:
pattern = [{'TAG':'Mood_sub'}, {'TAG':'VerbForm_ger'}]
In the case of a spanish phrase like: 'Que siga aprendiendo', 'siga' has 'Mood_sub' = True in its tag, and 'aprendiendo' has 'VerbForm_ger' = True in its tag. However, the matcher is not detecting this match.
Can anyone tell me why this is and how I could fix it? This is the code I am using:
model = spacy.load('es_core_news_md')
text = 'Que siga aprendiendo de sus alumnos'
doc = model(text)
pattern = [{'TAG':'Mood_sub'}, {'TAG':'VerbForm_ger'}]
matcher.add(1, None, pattern)
matches = matcher(doc)
for i, start, end in matches:
span = doc[start:end]
if len(span) > 0:
with doc.retokenize() as retokenizer:
retokenizer.merge(span)
The morph support isn't fully implemented in spacy v2, so this is not possible using the direct morph values like Mood_sub.
Instead, I think the best option with the Matcher to is use REGEX over the combined/extended TAG values. It's not going to be particularly elegant, but it should work:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('es_core_news_sm')
doc = nlp("Que siga aprendiendo de sus alumnos")
assert doc[1].tag_ == "AUX__Mood=Sub|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin"
matcher = Matcher(nlp.vocab)
matcher.add("MOOD_SUB", [[{"TAG": {"REGEX": ".*Mood=Sub.*"}}]])
assert matcher(doc) == [(513366231240698711, 1, 2)]

spaCy issue with 'Vocab' or 'StringStore'

I am training the Rasa NLU using spaCy for the pipeline, but when I try to train it I get this error from spaCy:
KeyError: "[E018] Can't retrieve string for hash '18446744072967274715'. This usually refers to an issue with the `Vocab` or `StringStore`."
I have python 3.7.3, spaCy version is 2.2.3, rasa version 1.6.1
Does someone knows how to fix this issue?
that's Sounds like a named mistake, I guess you applied a matcher for a text on another one, and the matcher_id became different, so that ist's getconfused.
to solve it make sure that you use the same matcher on the same text, like below:
Perform standard imports, reset nlp , PhraseMatcher library
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)
dd = 'refers to the economic policies associated with supply-side economics, voodoo economics'
doc3 = nlp(dd) # convert string to spacy.tokens.doc.Doc
First, create a list of match phrases:
phrase_list = ['voodoo economics', 'supply-side economics', 'free-market economics']
Next, convert each phrase to a Doc object:
phrase_patterns = [nlp(text) for text in phrase_list]
Pass each Doc object into matcher (note the use of the asterisk!):
matcher.add('VoodooEconomics', None, *phrase_patterns)
Build a list of matches:
matches = matcher(doc3)
matches #(match_id, start, end)
Viewing Matches:
for match_id, start, end in matches: # the matcher have to be the same one that we build on this text
string_id = nlp.vocab.strings[match_id]
span = doc3[start:end]
print(match_id, string_id, start, end, span.text)

Lemmatization of words using spacy and nltk not giving correct lemma

I want to get the lemmatized words of the words in list given below:
(eg)
words = ['Funnier','Funniest','mightiest','tighter']
When I do spacy,
import spacy
nlp = spacy.load('en')
words = ['Funnier','Funniest','mightiest','tighter','biggify']
doc = spacy.tokens.Doc(nlp.vocab, words=words)
for items in doc:
print(items.lemma_)
I got the lemmas like:
Funnier
Funniest
mighty
tight
When I go for nltk WordNetLemmatizer
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ['Funnier','Funniest','mightiest','tighter','biggify']
for token in words:
print(token + ' --> ' + lemmatizer.lemmatize(token))
I got:
Funnier : Funnier
Funniest : Funniest
mightiest : mightiest
tighter : tighter
Anyone help for this.
Thanks.
Lemmatisation is totally dependent on the part of speech tag that you are using while getting the lemma of the particular word.
# Define the sentence to be lemmatized
sentence = "The striped bats are hanging on their feet for best"
# Tokenize: Split the sentence into words
word_list = nltk.word_tokenize(sentence)
print(word_list)
#> ['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']
# Lemmatize list of words and join
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
print(lemmatized_output)
#> The striped bat are hanging on their foot for best
The above code is a simple example of how to use the wordnet lemmatizer on words and sentences.
Notice it didn’t do a good job. Because, ‘are’ is not converted to ‘be’ and ‘hanging’ is not converted to ‘hang’ as expected. This can be corrected if we provide the correct ‘part-of-speech’ tag (POS tag) as the second argument to lemmatize().
Sometimes, the same word can have a multiple lemmas based on the meaning / context.
print(lemmatizer.lemmatize("stripes", 'v'))
#> strip
print(lemmatizer.lemmatize("stripes", 'n'))
#> stripe
For the above example(), specify the corresponding pos tag:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ['Funnier','Funniest','mightiest','tighter','biggify']
for token in words:
print(token + ' --> ' + lemmatizer.lemmatize(token, wordnet.ADJ_SAT))

POS tagging a single word in spaCy

spaCy POS tagger is usally used on entire sentences. Is there a way to efficiently apply a unigram POS tagging to a single word (or a list of single words)?
Something like this:
words = ["apple", "eat", good"]
tags = get_tags(words)
print(tags)
> ["NNP", "VB", "JJ"]
Thanks.
English unigrams are often hard to tag well, so think about why you want to do this and what you expect the output to be. (Why is the POS of apple in your example NNP? What's the POS of can?)
spacy isn't really intended for this kind of task, but if you want to use spacy, one efficient way to do it is:
import spacy
nlp = spacy.load('en')
# disable everything except the tagger
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "tagger"]
nlp.disable_pipes(*other_pipes)
# use nlp.pipe() instead of nlp() to process multiple texts more efficiently
for doc in nlp.pipe(words):
if len(doc) > 0:
print(doc[0].text, doc[0].tag_)
See the documentation for nlp.pipe(): https://spacy.io/api/language#pipe
You can do something like this:
import spacy
nlp = spacy.load("en_core_web_sm")
word_list = ["apple", "eat", "good"]
for word in word_list:
doc = nlp(word)
print(doc[0].text, doc[0].pos_)
alternatively, you can do
import spacy
nlp = spacy.load("en_core_web_sm")
doc = spacy.tokens.doc.Doc(nlp.vocab, words=word_list)
for name, proc in nlp.pipeline:
doc = proc(doc)
pos_tags = [x.pos_ for x in doc]

Spacy - Tokenize quoted string

I am using spacy 2.0 and using a quoted string as input.
Example string
"The quoted text 'AA XX' should be tokenized"
and expecting to extract
[The, quoted, text, 'AA XX', should, be, tokenized]
I however get some strange results while experimenting. Noun chunks and ents looses one of the quote.
import spacy
nlp = spacy.load('en')
s = "The quoted text 'AA XX' should be tokenized"
doc = nlp(s)
print([t for t in doc])
print([t for t in doc.noun_chunks])
print([t for t in doc.ents])
Result
[The, quoted, text, ', AA, XX, ', should, be, tokenized]
[The quoted text 'AA XX]
[AA XX']
What is the best way to address what I need
While you could modify the tokenizer and add your own custom prefix, suffix and infix rules that exclude quotes, I'm not sure this is the best solution here.
For your use case, it might make more sense to add a component to your pipeline that merges (certain) quoted strings into one token before the tagger, parser and entity recognizer are called. To accomplish this, you can use the rule-based Matcher and find combinations of tokens surrounded by '. The following pattern looks for one or more alphanumeric characters:
pattern = [{'ORTH': "'"}, {'IS_ALPHA': True, 'OP': '+'}, {'ORTH': "'"}]
Here's a visual example of the pattern in the interactive matcher demo. To do the merging, you can then set up the Matcher, add the pattern and write a function that takes a Doc object, extracts the matched spans and merges them into one token by calling their .merge method.
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en')
matcher = Matcher(nlp.vocab)
matcher.add('QUOTED', None, [{'ORTH': "'"}, {'IS_ALPHA': True, 'OP': '+'}, {'ORTH': "'"}])
def quote_merger(doc):
# this will be called on the Doc object in the pipeline
matched_spans = []
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
matched_spans.append(span)
for span in matched_spans: # merge into one token after collecting all matches
span.merge()
return doc
nlp.add_pipe(quote_merger, first=True) # add it right after the tokenizer
doc = nlp("The quoted text 'AA XX' should be tokenized")
print([token.text for token in doc])
# ['The', 'quoted', 'text', "'AA XX'", 'should', 'be', 'tokenized']
For a more elegant solution, you can also refactor the component as a reusable class that sets up the matcher in its __init__ method (see the docs for examples).
If you add the component first in the pipeline, all other components like the tagger, parser and entity recognizer will only get to see the retokenized Doc. That's also why you might want to write more specific patterns that only merge certain quoted strings you care about. In your example, the new token boundaries improve the predictions – but I can also think of many other cases where they don't, especially if the quoted string is longer and contains a significant part of the sentence.

Resources