how to write spacy matcher of POS regex - nlp

Spacy has two features I'd like to combine - part-of-speech (POS) and rule-based matching.
How can I combine them in a neat way?
For example - let's say input is a single sentence and I'd like to verify it meets some POS ordering condition - for example the verb is after the noun (something like noun**verb regex). result should be true or false. Is that doable? or the matcher is specific like in the example
Rule-based matching can have POS rules?
If not - here is my current plan - gather everything in one string and apply regex
import spacy
nlp = spacy.load('en')
#doc = nlp(u'is there any way you can do it')
text=u'what are the main issues'
doc = nlp(text)
concatPos = ''
print(text)
for word in doc:
print(word.text, word.lemma, word.lemma_, word.tag, word.tag_, word.pos, word.pos_)
concatPos += word.text +"_" + word.tag_ + "_" + word.pos_ + "-"
print('-----------')
print(concatPos)
print('-----------')
# output of string- what_WP_NOUN-are_VBP_VERB-the_DT_DET-main_JJ_ADJ-issues_NNS_NOUN-

Sure, simply use the POS attribute.
import spacy
nlp = spacy.load('en')
from spacy.matcher import Matcher
from spacy.attrs import POS
matcher = Matcher(nlp.vocab)
matcher.add_pattern("Adjective and noun", [{POS: 'ADJ'}, {POS: 'NOUN'}])
doc = nlp(u'what are the main issues')
matches = matcher(doc)

Eyal Shulman's answer was helpful, but it makes you hard code a pattern matcher, not exactly use a regular expression.
I wanted to use regular expressions, so I made my own solution:
pattern = r'(<VERB>)*(<ADV>)*(<PART>)*(<VERB>)+(<PART>)*'
## create a string with the pos of the sentence
posString = ""
for w in doc[start:end].sent:
posString += "<" + w.pos_ + ">"
lstVerb = []
for m in re.compile(pattern).finditer(posString):
## each m is a verb phrase match
## count the "<" in m to find how many tokens we want
numTokensInGroup = m.group().count('<')
## then find the number of tokens that came before that group.
numTokensBeforeGroup = posString[:m.start()].count('<')
verbPhrase = sentence[numTokensBeforeGroup:numTokensBeforeGroup+numTokensInGroup]
## starting at character offset m.start()
lstVerb.append(verbPhrase)

Related

Spacy Extracting mentions with Matcher

everyone I am trying to match a sentence into a bigger sentence using Spacy rule-matcher, but the output is empty.
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_md")
doc1 = nlp("DUMMY TEXT CafeA is very generous with the portions. DUMMY TEXT DUMMY TEXT")
pattern = [{"ENT_TYPE": "ORG"}, {"LEMMA": "be"}, {"POS": "ADV", "OP": "*"}, {"POS": "ADJ"}]
matcher = Matcher(nlp.vocab)
matcher.add("mentions",[pattern])
for mid, start, end in matcher(doc1):
print(start, end, doc1[start:end])
The idea of the rule is to match the "CafeA is very generous with the portions." bit, but I do not get any result. What is the correct way to do this in spacy?
Any help will be appreciated
Your code produces 2 6 CafeA is very generous when I run it on my machine (Windows 11 64-bit, Python 3.10.9, spaCy 3.4.4) using both the en_core_web_md and en_core_web_trf pipelines. As a side note, "CafeA" is not tagged as an organisation when using en_core_web_sm, and therefore the pattern does not match.
If you want include "with the portions", you'll need to expand the pattern to include the appropriate PoS tags (i.e. ADP (adposition), DET (determiner) and NOUN (noun) respectively). For example:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_md")
doc1 = nlp("DUMMY TEXT CafeA is very generous with the portions. DUMMY TEXT DUMMY TEXT")
pattern = [{"ENT_TYPE": "ORG"}, {"LEMMA": "be"}, {"POS": "ADV", "OP": "*"}, {"POS": "ADJ"}, {"POS": "ADP"}, {"POS": "DET"}, {"POS": "NOUN"}]
for tok in doc1:
print(tok.text, tok.pos_)
print(tok.ent_type_)
matcher = Matcher(nlp.vocab)
matcher.add("mentions", [pattern])
for mid, start, end in matcher(doc1):
print(start, end, doc1[start:end])
Matches and prints the following:
2 9 CafeA is very generous with the portions
If you're getting unexpected results with the matcher, try restarting your IDE or clearing any cached/stored variables.

how to extract a PERSON named entity after certain word with spacy?

I have this text ( text2 in code), it has 3 'by' word, I want to use Spacy to extract the person's name (full name, even if it is 3 words, some races use long names, in this case 2). The code is below, my pattern shows error. My intention: first fix the 'by' word with ORTH, then to tell program that whatever coming next is the Part of Speech entity called PERSON. I would be happy if anyone help it:
import spacy
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
text2 = 'All is done by Emily Muller, the leaf is burned by fire. we were not happy, so we cut relations by saying bye bye'
def extract_person(nlp_doc):
pattern = [{'ORTH': 'by'}, {'POS': 'NOUN'}}]
# second possible pattern:
#pattern = [{"TEXT": "by"}, {"NER": "PERSON"}]
matcher.add('person_only', None, pattern)
matches = matcher(nlp_doc)
for match_id, start, end in matches:
span = nlp_doc[start:end]
return span.text
target_doc = nlp(text2)
extract_person(target_doc)
I think this question can be asked other way around: how to use NER tags in pattern in Matcher in spacy?
If you want to use whole names you should merge entities at the beginning. You can do it by calling: nlp.add_pipe("merge_entities", after="ner")
Then in your pattern instead of:
pattern = [{"TEXT": "by"}, {"NER": "PERSON"}]
Use:
pattern = [{"TEXT": "by"}, {"ENT_TYPE": "PERSON"}]
Complete code:
nlp.add_pipe("merge_entities", after="ner")
text2 = 'All is done by Emily Muller, the leaf is burned by fire. we were not happy, so we cut relations by saying bye bye'
doc = nlp(text2)
pattern = [{"TEXT": "by"}, {"ENT_TYPE": "PERSON"}]
matcher = Matcher(nlp.vocab)
matcher.add('person_only', [pattern])
matches = matcher(doc)
for match_id, start, end in matches:
print(doc[start:end])

How to make a spacy matcher pattern using verb tense/mood?

I've been trying to make a specific pattern for a spacy matcher using Verbs tenses and moods.
I found out how to access morphological features of words parsed with spacy using model.vocab.morphology.tag_map[token.tag_], which prints out something like this when the verb is in subjunctive mode (the mode I am interested in):
{'Mood_sub': True, 'Number_sing': True, 'Person_three': True, 'Tense_pres': True, 'VerbForm_fin': True, 74: 100}
however, I would like to have a pattern like this one to retokenize specific verb phrases:
pattern = [{'TAG':'Mood_sub'}, {'TAG':'VerbForm_ger'}]
In the case of a spanish phrase like: 'Que siga aprendiendo', 'siga' has 'Mood_sub' = True in its tag, and 'aprendiendo' has 'VerbForm_ger' = True in its tag. However, the matcher is not detecting this match.
Can anyone tell me why this is and how I could fix it? This is the code I am using:
model = spacy.load('es_core_news_md')
text = 'Que siga aprendiendo de sus alumnos'
doc = model(text)
pattern = [{'TAG':'Mood_sub'}, {'TAG':'VerbForm_ger'}]
matcher.add(1, None, pattern)
matches = matcher(doc)
for i, start, end in matches:
span = doc[start:end]
if len(span) > 0:
with doc.retokenize() as retokenizer:
retokenizer.merge(span)
The morph support isn't fully implemented in spacy v2, so this is not possible using the direct morph values like Mood_sub.
Instead, I think the best option with the Matcher to is use REGEX over the combined/extended TAG values. It's not going to be particularly elegant, but it should work:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('es_core_news_sm')
doc = nlp("Que siga aprendiendo de sus alumnos")
assert doc[1].tag_ == "AUX__Mood=Sub|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin"
matcher = Matcher(nlp.vocab)
matcher.add("MOOD_SUB", [[{"TAG": {"REGEX": ".*Mood=Sub.*"}}]])
assert matcher(doc) == [(513366231240698711, 1, 2)]

POS tagging a single word in spaCy

spaCy POS tagger is usally used on entire sentences. Is there a way to efficiently apply a unigram POS tagging to a single word (or a list of single words)?
Something like this:
words = ["apple", "eat", good"]
tags = get_tags(words)
print(tags)
> ["NNP", "VB", "JJ"]
Thanks.
English unigrams are often hard to tag well, so think about why you want to do this and what you expect the output to be. (Why is the POS of apple in your example NNP? What's the POS of can?)
spacy isn't really intended for this kind of task, but if you want to use spacy, one efficient way to do it is:
import spacy
nlp = spacy.load('en')
# disable everything except the tagger
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "tagger"]
nlp.disable_pipes(*other_pipes)
# use nlp.pipe() instead of nlp() to process multiple texts more efficiently
for doc in nlp.pipe(words):
if len(doc) > 0:
print(doc[0].text, doc[0].tag_)
See the documentation for nlp.pipe(): https://spacy.io/api/language#pipe
You can do something like this:
import spacy
nlp = spacy.load("en_core_web_sm")
word_list = ["apple", "eat", "good"]
for word in word_list:
doc = nlp(word)
print(doc[0].text, doc[0].pos_)
alternatively, you can do
import spacy
nlp = spacy.load("en_core_web_sm")
doc = spacy.tokens.doc.Doc(nlp.vocab, words=word_list)
for name, proc in nlp.pipeline:
doc = proc(doc)
pos_tags = [x.pos_ for x in doc]

How to logically segment a sentence using spacy ?

I am new to Spacy and trying to segment a sentence logically, so that I can process each part separately. e.g;
"If the country selected is 'US', then the zip code should be numeric"
This needs to be broken into :
If the country selected is 'US',
then the zip code should be numeric
Another sentence with comas should not be broken:
The allowed states are NY, NJ and CT
Any ideas, thoughts how to do this in spacy ?
I am not sure whether we can do this until we train the model using custom data. But spacy allows to add rules for tokenising and sentence segmenting etc..
The following code may be useful for this particular case and you can change the rules according your requirement.
#Importing spacy and Matcher to merge matched patterns
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en')
#Defining pattern i.e any text surrounded with '' should be merged into single token
matcher = Matcher(nlp.vocab)
pattern = [{'ORTH': "'"},
{'IS_ALPHA': True},
{'ORTH': "'"}]
#Adding pattern to the matcher
matcher.add('special_merger', None, pattern)
#Method to merge matched patterns
def special_merger(doc):
matched_spans = []
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
matched_spans.append(span)
for span in matched_spans:
span.merge()
return doc
#To determine whether a token can be start of the sentence.
def should_sentence_start(doc):
for token in doc:
if should_be_sentence_start(token):
token.is_sent_start = True
return doc
#Defining rule such that, if previous toke is "," and previous to previous token is "'US'"
#Then current token should be start of the sentence.
def should_be_sentence_start(token):
if token.i >= 2 and token.nbor(-1).text == "," and token.nbor(-2).text == "'US'" :
return True
else:
return False
#Adding matcher and sentence tokenizing to nlp pipeline.
nlp.add_pipe(special_merger, first=True)
nlp.add_pipe(should_sentence_start, before='parser')
#Applying NLP on requried text
sent_texts = "If the country selected is 'US', then the zip code should be numeric"
doc = nlp(sent_texts)
for sent in doc.sents:
print(sent)
Output:
If the country selected is 'US',
then the zip code should be numeric

Resources