Ignoring filler words in part of speech pattern NLTK - python-3.x

I have rule based text matching program that I've written that operates based on rules created using specific POS patterns. So for example one rule is:
pattern = [('PRP', "i'll"), ('VB', ('jump', 'play', 'bite', 'destroy'))]
In this case when analyzing my input text this will only return results in a string that fit grammatically to this specific pattern so:
I'll jump
I'll play
I'll bite
I'll destroy
My question involves extracting the same meaning of the text when people use the same text but add a superlative or any type of word that doesn't change context, right now it only does exact matches, but wont catch phrases like the first string in this example:
I'll 'freaking' jump
'Dammit' I'll play
I'll play 'dammit'
The word doesn't have have to be specific its just making sure the program can still identify the same pattern with the addition of a non-contextual superlative or any other type of word with the same purpose. This is the flagger I've written and I've given an example string:
string_list = [('Its', 'PRP$'), ('annoying', 'NN'), ('when', 'WRB'), ('a', 'DT'), ('kid', 'NN'), ('keeps', 'VBZ'), ('asking', 'VBG'), ('you', 'PRP'), ('to', 'TO'), ('play', 'VB'), ('but', 'CC'), ("I'll", 'NNP'), ('bloody', 'VBP'), ('play', 'VBP'), ('so', 'RB'), ('it', 'PRP'), ('doesnt', 'VBZ'), ('cry', 'NN')]
def find_match_pattern(string_list, pattern_dict):
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer() # does a sentiment analysis on the output string
filt_ = ['Filter phrases'] # not the patterns just phrases I know I dont want
filt_tup = [x.lower() for x in filt_]
for rule, pattern in pattern_dict.items(): # pattern dict is an Ordered Diction courtesy of collections
num_matched = 0
for idx, tuple in enumerate(string_list): # string_list is the input string that has been POS tagged
matched = False
if tuple[1] == list(pattern.keys())[num_matched]:
if tuple[0] in pattern[tuple[1]]:
num_matched += 1
else:
num_matched = 0
else:
num_matched = 0
if num_matched == len(pattern): # if the number of matching words equals the length of the pattern do this
matched_string = ' '.join([i[0] for i in string_list]) # Joined for the sentiment analysis score
vs = analyzer.polarity_scores(matched_string)
sentiment = vs['compound']
if matched_string in filt_tup:
break
elif (matched_string not in filt_tup) or (sentiment < -0.8):
matched = True
print(matched, '\n', matched_string, '\n', sentiment)
return (matched, sentiment, matched_string, rule)
I know its a really abstract (or down the rabbit hole) question, so it may be a discussion but if anyone has experience with this it would be awesome to see what you recommend.

Your question can be answered using Spacy's dependecy tagger. Spacy provides a matcher with many optional and switchable options.
In the case below, instead of basing on specific words or Parts of Speech, the focus was looking at certain sintatic functions, such as the nominal subject and the auxiliary verbs.
Here's a quick example:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en')
matcher = Matcher(nlp.vocab, validate=True)
pattern = [{'DEP': 'nsubj', 'OP': '+'}, # OP + means it has to be at least one nominal subject - usually a pronoun
{'DEP': 'aux', 'OP': '?'}, # OP ? means it can have one or zero auxiliary verbs
{'POS': 'ADV', 'OP': '?'}, # Now it looks for an adverb. Also, it is not needed (OP?)
{'POS': 'VERB'}] # Finally, I've generallized it with a verb, but you can make one pattern for each verb or write a loop to do it.
matcher.add("NVAV", None, pattern)
phrases = ["I\'ll really jump.",
"Okay, I\'ll play.",
"Dammit I\'ll play",
"I\'ll play dammit",
"He constantly plays it",
"She usually works there"]
for phrase in phrases:
doc = nlp(phrase)
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
print('Matched:',span.text)
Matched: I'll really jump
Matched: I'll play
Matched: I'll play
Matched: I'll play
Matched: He constantly plays
Matched: She usually works
You can always test your patterns in the live example: Spacy Live Example
You can extend it as you will. Read more here:https://spacy.io/usage/rule-based-matching

Related

Wordnet: Finding the most common hypernyms

The task that I am trying to achieve is finding the top 20 most common hypernyms for all nouns and verbs in a text file. I believe that my output is erroneous and that there is a more elegant solution, particularly to avoid manually creating a list of the most common nouns and verbs and the code that iterates over the synsets to identify the hypernyms.
Please see below for the code I have attempted so far, any guidance would be appreciated:
nouns_verbs = [token.text for token in hamlet_spacy if (not token.is_stop and not token.is_punct and token.pos_ == "VERB" or token.pos_ == "NOUN")]
def check_hypernym(word_list):
return_list=[]
for word in word_list:
w = wordnet.synsets(word)
for syn in w:
if not((len(syn.hypernyms()))==0):
return_list.append(word)
break
return return_list
hypernyms = check_hyper(nouns_verbs)
fd = nltk.FreqDist(hypernyms)
top_20 = fd.most_common(20)
word_list = ['lord', 't', 'know', 'come', 'love', 's', 'sir', 'thou', 'speak', 'let', 'man', 'father', 'think', 'time', 'Let', 'tell', 'night', 'death', 'soul', 'mother']
hypernym_list = []
for word in word_list:
syn_list = wordnet.synsets(word)
hypernym_list.append(syn_list)
final_list = []
for syn in syn_list:
hypernyms_syn = syn.hypernyms()
final_list.append(hypernyms_syn)
final_list
I tried identifying the top 20 most common words and verbs, and then identified their synsets and subsequently their hypernyms. I would prefer to use a more cohesive solution, especially since I am unsure of whether my current result is accurate.
For the first part of getting all nouns and verbs from the text, you didn't provide the original text so I wasn't able to reproduce this but I believe you can shorten this since it is given that if a token is a noun or verb it is not punctuation. You can also use in so that you don't need two separate boolean conditions for NOUN and VERB.
nouns_verbs = [token.text for token in hamlet_spacy if not token.is_stop and token.pos_ in ["VERB", "NOUN"]]
Other than that it looks fine.
For the second part of getting the most common hypernyms, your general approach is fine. You could make it a little more memory efficient for long texts where you potentially have the same hypernym appearing many times by using a Counter object from the get-go instead of constructing a long list. See the below code.
from nltk.corpus import wordnet as wn
from collections import Counter
word_list = ['lord', 't', 'know', 'come', 'love', 's', 'sir', 'thou', 'speak', 'let', 'man', 'father', 'think', 'time', 'Let', 'tell', 'night', 'death', 'soul', 'mother']
hypernym_counts = Counter()
for word in word_list:
for synset in wn.synsets(word):
hypernym_counts.update(synset.hypernyms())
top_20_hypernyms = hypernym_counts.most_common()[:20]
for i, hypernym in enumerate(top_20_hypernyms, start=1):
hypernym, count = hypernym
print(f"{i}. {hypernym.name()} ({count})")
Outputs:
1. time_period.n.01 (6)
2. be.v.01 (3)
3. communicate.v.02 (3)
4. male.n.02 (3)
5. think.v.03 (3)
6. male_aristocrat.n.01 (2)
7. letter.n.02 (2)
8. thyroid_hormone.n.01 (2)
9. experience.v.01 (2)
10. copulate.v.01 (2)
11. travel.v.01 (2)
12. time_unit.n.01 (2)
13. serve.n.01 (2)
14. induce.v.02 (2)
15. accept.v.03 (2)
16. make.v.02 (2)
17. leave.v.04 (2)
18. give.v.03 (2)
19. parent.n.01 (2)
20. make.v.03 (2)

spaCy Matcher Rule not finding phrase in text

Please I have this spaCy matcher object I created with the rule to match an adjective and one or two nouns. Unfortunately, the expected output of beautiful design, smart search, *automatic labels, optional voice responses are not being returned. And I can't decipher what the problem is with my code.
doc = nlp("Features of the app include a beautiful design, smart search, automatic labels and optional voice responses.")
# Write a pattern for adjective plus one or two nouns
pattern = [{'POS': 'ADJ'}, {'POS': 'NOUN'}, {'POS': 'NOUN', 'OP': '?'}]
# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('ADJ_NOUN_PATTERN', [pattern])
matches = matcher(doc)
print('Total matches found:', len(matches))
# Iterate over the matches and print the span text
for match_id, start, end in matches:
print('Match found:', doc[start:end].text)
It worked for me. I used the large pipeline, 'en_core_web_lg'.
Which pipeline do you use ? And how do you declare your matcher ?
Here is my code :
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_lg')
matcher = Matcher(nlp.vocab)
# +Your code
doc = nlp("Features of the app include a beautiful design, smart search, automatic labels and optional voice responses.")
# Write a pattern for adjective plus one or two nouns
pattern = [{'POS': 'ADJ'}, {'POS': 'NOUN'}, {'POS': 'NOUN', 'OP': '?'}]
# Add the pattern to the matcher and apply the matcher to the doc
matcher.add('ADJ_NOUN_PATTERN', [pattern])
matches = matcher(doc)
print('Total matches found:', len(matches))
# Iterate over the matches and print the span text
for match_id, start, end in matches:
print('Match found:', doc[start:end].text)

how to extract a PERSON named entity after certain word with spacy?

I have this text ( text2 in code), it has 3 'by' word, I want to use Spacy to extract the person's name (full name, even if it is 3 words, some races use long names, in this case 2). The code is below, my pattern shows error. My intention: first fix the 'by' word with ORTH, then to tell program that whatever coming next is the Part of Speech entity called PERSON. I would be happy if anyone help it:
import spacy
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
text2 = 'All is done by Emily Muller, the leaf is burned by fire. we were not happy, so we cut relations by saying bye bye'
def extract_person(nlp_doc):
pattern = [{'ORTH': 'by'}, {'POS': 'NOUN'}}]
# second possible pattern:
#pattern = [{"TEXT": "by"}, {"NER": "PERSON"}]
matcher.add('person_only', None, pattern)
matches = matcher(nlp_doc)
for match_id, start, end in matches:
span = nlp_doc[start:end]
return span.text
target_doc = nlp(text2)
extract_person(target_doc)
I think this question can be asked other way around: how to use NER tags in pattern in Matcher in spacy?
If you want to use whole names you should merge entities at the beginning. You can do it by calling: nlp.add_pipe("merge_entities", after="ner")
Then in your pattern instead of:
pattern = [{"TEXT": "by"}, {"NER": "PERSON"}]
Use:
pattern = [{"TEXT": "by"}, {"ENT_TYPE": "PERSON"}]
Complete code:
nlp.add_pipe("merge_entities", after="ner")
text2 = 'All is done by Emily Muller, the leaf is burned by fire. we were not happy, so we cut relations by saying bye bye'
doc = nlp(text2)
pattern = [{"TEXT": "by"}, {"ENT_TYPE": "PERSON"}]
matcher = Matcher(nlp.vocab)
matcher.add('person_only', [pattern])
matches = matcher(doc)
for match_id, start, end in matches:
print(doc[start:end])

How to detokenize spacy text without doc context?

I have a sequence to sequence model trained on tokens formed by spacy's tokenization. This is both encoder and decoder.
The output is a stream of tokens from a seq2seq model. I want to detokenize the text to form natural text.
Example:
Input to Seq2Seq: Some text
Output from Seq2Seq: This does n't work .
Is there any API in spacy to reverse tokenization done by rules in its tokenizer?
Internally spaCy keeps track of a boolean array to tell whether the tokens have trailing whitespace. You need this array to put the string back together. If you're using a seq2seq model, you could predict the spaces separately.
James Bradbury (author of TorchText) was complaining to me about exactly this. He's right that I didn't think about seq2seq models when I designed the tokenization system in spaCy. He developed revtok to solve his problem.
Basically what revtok does (if I understand correctly) is pack two extra bits onto the lexeme IDs: whether the lexeme has an affinity for a preceding space, and whether it has an affinity for a following space. Spaces are inserted between tokens whose lexemes both have space affinity.
Here's the code to find these bits for a spaCy Doc:
def has_pre_space(token):
if token.i == 0:
return False
if token.nbor(-1).whitespace_:
return True
else:
return False
def has_space(token):
return token.whitespace_
The trick is that you drop a space when either the current lexeme says "no trailing space" or the next lexeme says "no leading space". This means you can decide which of those two lexemes to "blame" for the lack of the space, using frequency statistics.
James's point is that this strategy adds very little entropy to the word prediction decision. Alternate schemes will expand the lexicon with entries like hello. or "Hello. His approach does neither, because you can code the string hello. as either (hello, 1, 0), (., 1, 1) or as (hello, 1, 0), (., 0, 1). This choice is easy: we should definitely "blame" the period for the lack of the space.
TL;DR
I've written a code that attempts to do it, the snippet is below.
Another approach, with a computational complexity of O(n^2) * would be to use a function I just wrote.
The main thought was "What spaCy splits, shall be rejoined once more!"
Code:
#!/usr/bin/env python
import spacy
import string
class detokenizer:
""" This class is an attempt to detokenize spaCy tokenized sentence """
def __init__(self, model="en_core_web_sm"):
self.nlp = spacy.load(model)
def __call__(self, tokens : list):
""" Call this method to get list of detokenized words """
while self._connect_next_token_pair(tokens):
pass
return tokens
def get_sentence(self, tokens : list) -> str:
""" call this method to get detokenized sentence """
return " ".join(self(tokens))
def _connect_next_token_pair(self, tokens : list):
i = self._find_first_pair(tokens)
if i == -1:
return False
tokens[i] = tokens[i] + tokens[i+1]
tokens.pop(i+1)
return True
def _find_first_pair(self,tokens):
if len(tokens) <= 1:
return -1
for i in range(len(tokens)-1):
if self._would_spaCy_join(tokens,i):
return i
return -1
def _would_spaCy_join(self, tokens, index):
"""
Check whether the sum of lengths of spaCy tokenized words is equal to the length of joined and then spaCy tokenized words...
In other words, we say we should join only if the join is reversible.
eg.:
for the text ["The","man","."]
we would joins "man" with "."
but wouldn't join "The" with "man."
"""
left_part = tokens[index]
right_part = tokens[index+1]
length_before_join = len(self.nlp(left_part)) + len(self.nlp(right_part))
length_after_join = len(self.nlp(left_part + right_part))
if self.nlp(left_part)[-1].text in string.punctuation:
return False
return length_before_join == length_after_join
Usage:
import spacy
dt = detokenizer()
sentence = "I am the man, who dont dont know. And who won't. be doing"
nlp = spacy.load("en_core_web_sm")
spaCy_tokenized = nlp(sentence)
string_tokens = [a.text for a in spaCy_tokenized]
detokenized_sentence = dt.get_sentence(string_tokens)
list_of_words = dt(string_tokens)
print(sentence)
print(detokenized_sentence)
print(string_tokens)
print(list_of_words)
output:
I am the man, who dont dont know. And who won't. be doing
I am the man, who dont dont know. And who won't . be doing
['I', 'am', 'the', 'man', ',', 'who', 'do', 'nt', 'do', 'nt', 'know', '.', 'And', 'who', 'wo', "n't", '.', 'be', 'doing']
['I', 'am', 'the', 'man,', 'who', 'dont', 'dont', 'know.', 'And', 'who', "won't", '.', 'be', 'doing']
Downsides:
In this approach you may easily merge "do" and "nt", as well as strip space between the dot "." and preceding word.
This method is not perfect, as there are multiple possible combinations of sentences that lead to specific spaCy tokenization.
I am not sure if there is a method to fully detokenize a sentence when all you have is spaCy separated text, but this is the best I've got.
After having searched for hours on Google, only a few answers came along, with this very stack question being opened on 3 of my tabs on chrome ;), and all it wrote was basically "don't use spaCy, use revtok". As I couldn't change the tokenization other researchers chose, I had to develop my own solution. Hope it helps someone ;)

How to simplify the function which finds homographs?

I wrote the function which finds homographs in a text.
A homograph is a word that shares the same written form as another
word but has a different meaning.
For this I've used POS-Tagger from NLTK(pos_tag).
POS-tagger processes a sequence of words, and attaches a part of
speech tag to each word.
For example:
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')].
Code(Edited):
def find_homographs(text):
homographs_dict = {}
if isinstance(text, str):
text = word_tokenize(text)
tagged_tokens = pos_tag(text)
for tag1 in tagged_tokens:
for tag2 in tagged_tokens:
try:
if dict1[tag2] == tag1:
continue
except KeyError:
if tag1[0] == tag2[0] and tag1[1] != tag2[1]:
dict1[tag1] = tag2
return homographs_dict
It works, But takes too much time, because I've used two cycles for. Please, advice me how can I simplify it and make much faster.
It may seem counterintuitive, but you can easily collect all POS tags for each word in your text, then keep just the words that have multiple tags.
from collections import defaultdict
alltags = defaultdict(set)
for word, tag in tagged_tokens:
alltags[word].add(tag)
homographs = dict((w, tags) for w, tags in alltags.items() if len(tags) > 1)
Note the two-variable loop; it's a lot handier than writing tag1[0] and tag1[1]. defaultdict (and set) you'll have to look up in the manual.
Your output format cannot handle words with three or more POS tags, so the dictionary homographs has words as keys and sets of POS tags as values.
And two more things I would advise: (1) convert all words to lower case to catch more "homographs"; and (2) nltk.pos_tag() expects to be called on one sentence at a time, so you'll get more correct tags if you sent_tokenize() your text and word_tokenize() and pos_tag() each sentence separately.
Here is a suggestion (not tested) but the main idea is to build a dictionary when parsing tagged_tokens, to identify homographs in non-nested loop:
temp_dict = dict()
for tag in tagged_tokens:
temp_dict[tag[0]] = temp_dict.get(tag[0],list()).append(tag[1])
for temp in temp_dict.items():
if len(temp[1]) == 1:
del temp_dict[temp [0]]
print (temp_dict)

Resources