The task that I am trying to achieve is finding the top 20 most common hypernyms for all nouns and verbs in a text file. I believe that my output is erroneous and that there is a more elegant solution, particularly to avoid manually creating a list of the most common nouns and verbs and the code that iterates over the synsets to identify the hypernyms.
Please see below for the code I have attempted so far, any guidance would be appreciated:
nouns_verbs = [token.text for token in hamlet_spacy if (not token.is_stop and not token.is_punct and token.pos_ == "VERB" or token.pos_ == "NOUN")]
def check_hypernym(word_list):
return_list=[]
for word in word_list:
w = wordnet.synsets(word)
for syn in w:
if not((len(syn.hypernyms()))==0):
return_list.append(word)
break
return return_list
hypernyms = check_hyper(nouns_verbs)
fd = nltk.FreqDist(hypernyms)
top_20 = fd.most_common(20)
word_list = ['lord', 't', 'know', 'come', 'love', 's', 'sir', 'thou', 'speak', 'let', 'man', 'father', 'think', 'time', 'Let', 'tell', 'night', 'death', 'soul', 'mother']
hypernym_list = []
for word in word_list:
syn_list = wordnet.synsets(word)
hypernym_list.append(syn_list)
final_list = []
for syn in syn_list:
hypernyms_syn = syn.hypernyms()
final_list.append(hypernyms_syn)
final_list
I tried identifying the top 20 most common words and verbs, and then identified their synsets and subsequently their hypernyms. I would prefer to use a more cohesive solution, especially since I am unsure of whether my current result is accurate.
For the first part of getting all nouns and verbs from the text, you didn't provide the original text so I wasn't able to reproduce this but I believe you can shorten this since it is given that if a token is a noun or verb it is not punctuation. You can also use in so that you don't need two separate boolean conditions for NOUN and VERB.
nouns_verbs = [token.text for token in hamlet_spacy if not token.is_stop and token.pos_ in ["VERB", "NOUN"]]
Other than that it looks fine.
For the second part of getting the most common hypernyms, your general approach is fine. You could make it a little more memory efficient for long texts where you potentially have the same hypernym appearing many times by using a Counter object from the get-go instead of constructing a long list. See the below code.
from nltk.corpus import wordnet as wn
from collections import Counter
word_list = ['lord', 't', 'know', 'come', 'love', 's', 'sir', 'thou', 'speak', 'let', 'man', 'father', 'think', 'time', 'Let', 'tell', 'night', 'death', 'soul', 'mother']
hypernym_counts = Counter()
for word in word_list:
for synset in wn.synsets(word):
hypernym_counts.update(synset.hypernyms())
top_20_hypernyms = hypernym_counts.most_common()[:20]
for i, hypernym in enumerate(top_20_hypernyms, start=1):
hypernym, count = hypernym
print(f"{i}. {hypernym.name()} ({count})")
Outputs:
1. time_period.n.01 (6)
2. be.v.01 (3)
3. communicate.v.02 (3)
4. male.n.02 (3)
5. think.v.03 (3)
6. male_aristocrat.n.01 (2)
7. letter.n.02 (2)
8. thyroid_hormone.n.01 (2)
9. experience.v.01 (2)
10. copulate.v.01 (2)
11. travel.v.01 (2)
12. time_unit.n.01 (2)
13. serve.n.01 (2)
14. induce.v.02 (2)
15. accept.v.03 (2)
16. make.v.02 (2)
17. leave.v.04 (2)
18. give.v.03 (2)
19. parent.n.01 (2)
20. make.v.03 (2)
Related
I have rule based text matching program that I've written that operates based on rules created using specific POS patterns. So for example one rule is:
pattern = [('PRP', "i'll"), ('VB', ('jump', 'play', 'bite', 'destroy'))]
In this case when analyzing my input text this will only return results in a string that fit grammatically to this specific pattern so:
I'll jump
I'll play
I'll bite
I'll destroy
My question involves extracting the same meaning of the text when people use the same text but add a superlative or any type of word that doesn't change context, right now it only does exact matches, but wont catch phrases like the first string in this example:
I'll 'freaking' jump
'Dammit' I'll play
I'll play 'dammit'
The word doesn't have have to be specific its just making sure the program can still identify the same pattern with the addition of a non-contextual superlative or any other type of word with the same purpose. This is the flagger I've written and I've given an example string:
string_list = [('Its', 'PRP$'), ('annoying', 'NN'), ('when', 'WRB'), ('a', 'DT'), ('kid', 'NN'), ('keeps', 'VBZ'), ('asking', 'VBG'), ('you', 'PRP'), ('to', 'TO'), ('play', 'VB'), ('but', 'CC'), ("I'll", 'NNP'), ('bloody', 'VBP'), ('play', 'VBP'), ('so', 'RB'), ('it', 'PRP'), ('doesnt', 'VBZ'), ('cry', 'NN')]
def find_match_pattern(string_list, pattern_dict):
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer() # does a sentiment analysis on the output string
filt_ = ['Filter phrases'] # not the patterns just phrases I know I dont want
filt_tup = [x.lower() for x in filt_]
for rule, pattern in pattern_dict.items(): # pattern dict is an Ordered Diction courtesy of collections
num_matched = 0
for idx, tuple in enumerate(string_list): # string_list is the input string that has been POS tagged
matched = False
if tuple[1] == list(pattern.keys())[num_matched]:
if tuple[0] in pattern[tuple[1]]:
num_matched += 1
else:
num_matched = 0
else:
num_matched = 0
if num_matched == len(pattern): # if the number of matching words equals the length of the pattern do this
matched_string = ' '.join([i[0] for i in string_list]) # Joined for the sentiment analysis score
vs = analyzer.polarity_scores(matched_string)
sentiment = vs['compound']
if matched_string in filt_tup:
break
elif (matched_string not in filt_tup) or (sentiment < -0.8):
matched = True
print(matched, '\n', matched_string, '\n', sentiment)
return (matched, sentiment, matched_string, rule)
I know its a really abstract (or down the rabbit hole) question, so it may be a discussion but if anyone has experience with this it would be awesome to see what you recommend.
Your question can be answered using Spacy's dependecy tagger. Spacy provides a matcher with many optional and switchable options.
In the case below, instead of basing on specific words or Parts of Speech, the focus was looking at certain sintatic functions, such as the nominal subject and the auxiliary verbs.
Here's a quick example:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en')
matcher = Matcher(nlp.vocab, validate=True)
pattern = [{'DEP': 'nsubj', 'OP': '+'}, # OP + means it has to be at least one nominal subject - usually a pronoun
{'DEP': 'aux', 'OP': '?'}, # OP ? means it can have one or zero auxiliary verbs
{'POS': 'ADV', 'OP': '?'}, # Now it looks for an adverb. Also, it is not needed (OP?)
{'POS': 'VERB'}] # Finally, I've generallized it with a verb, but you can make one pattern for each verb or write a loop to do it.
matcher.add("NVAV", None, pattern)
phrases = ["I\'ll really jump.",
"Okay, I\'ll play.",
"Dammit I\'ll play",
"I\'ll play dammit",
"He constantly plays it",
"She usually works there"]
for phrase in phrases:
doc = nlp(phrase)
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
print('Matched:',span.text)
Matched: I'll really jump
Matched: I'll play
Matched: I'll play
Matched: I'll play
Matched: He constantly plays
Matched: She usually works
You can always test your patterns in the live example: Spacy Live Example
You can extend it as you will. Read more here:https://spacy.io/usage/rule-based-matching
I am trying to look for keywords in sentences which is stored as a list of lists. The outer list contains sentences and the inner list contains words in sentences. I want to iterate over each word in each sentence to look for keywords defined and return me the values where found.
This is how my token_sentences looks like.
I took help from this post. How to iterate through a list of lists in python? However, I am getting an empty list in return.
This is the code I have written.
import nltk
from nltk.tokenize import TweetTokenizer, sent_tokenize, word_tokenize
text = "MDCT SCAN OF THE CHEST: HISTORY: Follow-up LUL nodule. TECHNIQUES: Non-enhanced and contrast-enhanced MDCT scans were performed with a slice thickness of 2 mm. COMPARISON: Chest CT dated on 01/05/2018, 05/02/207, 28/09/2016, 25/02/2016, and 21/11/2015. FINDINGS: Lung parenchyma: There is further increased size and solid component of part-solid nodule associated with internal bubbly lucency and pleural tagging at apicoposterior segment of the LUL (SE 3; IM 38-50), now measuring about 2.9x1.7 cm in greatest transaxial dimension (previously size 2.5x1.3 cm in 2015). Also further increased size of two ground-glass nodules at apicoposterior segment of the LUL (SE 3; IM 37), and superior segment of the LLL (SE 3; IM 58), now measuring about 1 cm (previously size 0.4 cm in 2015), and 1.1 cm (previously size 0.7 cm in 2015) in greatest transaxial dimension, respectively."
tokenizer_words = TweetTokenizer()
tokens_sentences = [tokenizer_words.tokenize(t) for t in
nltk.sent_tokenize(text)]
nodule_keywords = ["nodules","nodule"]
count_nodule =[]
def GetNodule(sentence, keyword_list):
s1 = sentence.split(' ')
return [i for i in s1 if i in keyword_list]
for sub_list in tokens_sentences:
result_calcified_nod = GetNodule(sub_list[0], nodule_keywords)
count_nodule.append(result_calcified_nod)
However, I am getting the empty list as a result for the variable in count_nodule.
This is the value of first two rows of "token_sentences".
token_sentences = [['MDCT', 'SCAN', 'OF', 'THE', 'CHEST', ':', 'HISTORY', ':', 'Follow-up', 'LUL', 'nodule', '.'],['TECHNIQUES', ':', 'Non-enhanced', 'and', 'contrast-enhanced', 'MDCT', 'scans', 'were', 'performed', 'with', 'a', 'slice', 'thickness', 'of', '2', 'mm', '.']]
Please help me to figure out where I am doing wrong!
You need to remove s1 = sentence.split(' ') from GetNodule because sentence has already been tokenized (it is already a List).
Remove the [0] from GetNodule(sub_list[0], nodule_keywords). Not sure why you would want to pass the first word of each sentence into GetNodule!
The error is here:
for sub_list in tokens_sentences:
result_calcified_nod = GetNodule(sub_list[0], nodule_keywords)
You are looping over each sub_list in tokens_sentences, but only passing the first word sub_list[0] to GetNodule.
This type of error is fairly common, and somewhat hard to catch, because Python code which expects a list of strings will happily accept and iterate over the individual characters in a single string instead if you call it incorrectly. If you want to be defensive, maybe it would be a good idea to add something like
assert not all(len(x)==1 for x in sentence)
And of course, as #dyz notes in their answer, if you expect sentence to already be a list of words, there is no need to split anything inside the function. Just loop over the sentence.
return [w for w in sentence if w in keyword_list]
As an aside, you probably want to extend the final result with the list result_calcified_nod rather than append it.
I use spaCy to locate verbs in sentences via POS tags, and then try to manipulate the verb. The manipulation of the verbs is dependent on a condition - for example depending on the word that precedes the verb. For example, I might want to convert this sentence - containing three verbs (does, hurt, run):
(1) "Why does it hurt to run very fast."
into this sentence:
(2) "It hurts to run very fast."
This looks straightforward to me. However, somehow my function has a problem when it encounters the same POS tag twice in the same sentence. It looks like in that case one of the IF clauses (line 13 below) is not updated, so that it evaluates as False while it should be True. I cannot figure out what I am overlooking and how to solve it. Here is my code:
import pandas as pd
import spacy
nlp = spacy.load('en')
s = "Why does it hurt to run very fast."
df = pd.DataFrame({'sentence':[s]})
k = df['sentence']
1 def marking(row):
2 L = row
3 verblst = [('VB'), ('VBZ'), ('VBP')] # list of verb POS tags to focus on
4 chunks = []
5 pos = []
6 for token in nlp(L):
7 pos.append(token.tag_) # Just to check if POS tags are handled well
8 print(pos)
9 if "Why" in L:
10 for token in nlp(L):
11 if token.tag_ in verblst:
# This line checks the POS tag of the word preceding the verb:
12 print(pos[pos.index(token.tag_)-1])
13 if pos[pos.index(token.tag_)-1] == 'TO': # Here things go wrong
14 chunks.append(token.text + token.whitespace_)
15 elif pos[pos.index(token.tag_)-1] == 'WRB':
16 chunks.append(token.text + token.whitespace_)
17 else:
18 chunks.append(token.text + 's' + token.whitespace_)
19 else:
20 chunks.append(token.text_with_ws)
L = chunks
L.pop(0)
L.pop(0)
L = [L[0].capitalize()] + L[1:]
L = "".join(L)
return L
x = k.apply(marking)
print(x)
This gives the following result:
"It hurts to runs very fast." # The 's' after run should not be there
0 1 2 3 4 5 6 7 8
POS list of s: ['WRB', 'VBZ', 'PRP', 'VB', 'TO', 'VB', 'RB', 'RB', '.']
sentence s: "Why does it hurt to run very fast. ."
The problem is caused by the fact that 'VB' is found at both index 3 and 5. It looks like the index in line 13 is not updated after the first 'VB' - which I expected to happen automatically. As a result, with the second 'VB', line 13 looks at index 2 instead of index 4. Hence, the condition in 13 is not met, and the second VB is processed in line 18 - resulting in a mistake. I am puzzled by why this happens. What am I not seeing? And how can this be solved?
Thanks so much for any help.
It seems like the problem here is that you're only looking up the index of the token.tag_ string value in your list of part-of-speech tag strings that you've compiled upfront. This always returns the first match – so in the case of "run", your script doesn't actually check the POS before index 5 (which would be TO), but instead, the POS before index 3 (which is PRP).
Consider the following abstract example:
test = ['a', 'b', 'c', 'a', 'd']
for value in test:
print(test.index(value)) # this will print 0, 1, 2, 0, 4
A better (and potentially also much simpler) solution would be to just iterate over the Token objects and use the Token.i attribute, which returns its index in the parent document. Ideally, you want to process the text once, store the doc and then index into it later when you need it. For example:
chunks = []
doc = nlp("Why does it hurt to run very fast.")
if doc[0].text == 'Why': # the first token's text is "Why"
for token in doc:
if token.tag_ in ['VB', 'VBZ', 'VBP']:
token_index = token.i # this is the token index in the document
prev_token = doc[token_index - 1] # the previous token in the document
if prev_token.tag_ == 'TO':
chunks.append(token.text_with_ws) # token text + whitespace
# and so on
Ideally, you always want to convert spaCy's output to plain text as late as possible. Most of the problems you were trying to solve in your code are things that spaCy already does for you – for example, it gives you the Doc object and its views Span and Token that are performant, let you index into them, iterate over tokens anywhere and, more importantly, never destroy any information available in the original text. Once your output is a single string of text plus whitespace plus other characters you've added, you won't be able to recover the original tokens very easily. You also won't know which token had whitespace attached and how the individual tokens are/were related to each other.
For more details on the Doc, Token and Span objects, see this section in the docs and the API reference, which lists the available attributes for each object.
I have a sequence to sequence model trained on tokens formed by spacy's tokenization. This is both encoder and decoder.
The output is a stream of tokens from a seq2seq model. I want to detokenize the text to form natural text.
Example:
Input to Seq2Seq: Some text
Output from Seq2Seq: This does n't work .
Is there any API in spacy to reverse tokenization done by rules in its tokenizer?
Internally spaCy keeps track of a boolean array to tell whether the tokens have trailing whitespace. You need this array to put the string back together. If you're using a seq2seq model, you could predict the spaces separately.
James Bradbury (author of TorchText) was complaining to me about exactly this. He's right that I didn't think about seq2seq models when I designed the tokenization system in spaCy. He developed revtok to solve his problem.
Basically what revtok does (if I understand correctly) is pack two extra bits onto the lexeme IDs: whether the lexeme has an affinity for a preceding space, and whether it has an affinity for a following space. Spaces are inserted between tokens whose lexemes both have space affinity.
Here's the code to find these bits for a spaCy Doc:
def has_pre_space(token):
if token.i == 0:
return False
if token.nbor(-1).whitespace_:
return True
else:
return False
def has_space(token):
return token.whitespace_
The trick is that you drop a space when either the current lexeme says "no trailing space" or the next lexeme says "no leading space". This means you can decide which of those two lexemes to "blame" for the lack of the space, using frequency statistics.
James's point is that this strategy adds very little entropy to the word prediction decision. Alternate schemes will expand the lexicon with entries like hello. or "Hello. His approach does neither, because you can code the string hello. as either (hello, 1, 0), (., 1, 1) or as (hello, 1, 0), (., 0, 1). This choice is easy: we should definitely "blame" the period for the lack of the space.
TL;DR
I've written a code that attempts to do it, the snippet is below.
Another approach, with a computational complexity of O(n^2) * would be to use a function I just wrote.
The main thought was "What spaCy splits, shall be rejoined once more!"
Code:
#!/usr/bin/env python
import spacy
import string
class detokenizer:
""" This class is an attempt to detokenize spaCy tokenized sentence """
def __init__(self, model="en_core_web_sm"):
self.nlp = spacy.load(model)
def __call__(self, tokens : list):
""" Call this method to get list of detokenized words """
while self._connect_next_token_pair(tokens):
pass
return tokens
def get_sentence(self, tokens : list) -> str:
""" call this method to get detokenized sentence """
return " ".join(self(tokens))
def _connect_next_token_pair(self, tokens : list):
i = self._find_first_pair(tokens)
if i == -1:
return False
tokens[i] = tokens[i] + tokens[i+1]
tokens.pop(i+1)
return True
def _find_first_pair(self,tokens):
if len(tokens) <= 1:
return -1
for i in range(len(tokens)-1):
if self._would_spaCy_join(tokens,i):
return i
return -1
def _would_spaCy_join(self, tokens, index):
"""
Check whether the sum of lengths of spaCy tokenized words is equal to the length of joined and then spaCy tokenized words...
In other words, we say we should join only if the join is reversible.
eg.:
for the text ["The","man","."]
we would joins "man" with "."
but wouldn't join "The" with "man."
"""
left_part = tokens[index]
right_part = tokens[index+1]
length_before_join = len(self.nlp(left_part)) + len(self.nlp(right_part))
length_after_join = len(self.nlp(left_part + right_part))
if self.nlp(left_part)[-1].text in string.punctuation:
return False
return length_before_join == length_after_join
Usage:
import spacy
dt = detokenizer()
sentence = "I am the man, who dont dont know. And who won't. be doing"
nlp = spacy.load("en_core_web_sm")
spaCy_tokenized = nlp(sentence)
string_tokens = [a.text for a in spaCy_tokenized]
detokenized_sentence = dt.get_sentence(string_tokens)
list_of_words = dt(string_tokens)
print(sentence)
print(detokenized_sentence)
print(string_tokens)
print(list_of_words)
output:
I am the man, who dont dont know. And who won't. be doing
I am the man, who dont dont know. And who won't . be doing
['I', 'am', 'the', 'man', ',', 'who', 'do', 'nt', 'do', 'nt', 'know', '.', 'And', 'who', 'wo', "n't", '.', 'be', 'doing']
['I', 'am', 'the', 'man,', 'who', 'dont', 'dont', 'know.', 'And', 'who', "won't", '.', 'be', 'doing']
Downsides:
In this approach you may easily merge "do" and "nt", as well as strip space between the dot "." and preceding word.
This method is not perfect, as there are multiple possible combinations of sentences that lead to specific spaCy tokenization.
I am not sure if there is a method to fully detokenize a sentence when all you have is spaCy separated text, but this is the best I've got.
After having searched for hours on Google, only a few answers came along, with this very stack question being opened on 3 of my tabs on chrome ;), and all it wrote was basically "don't use spaCy, use revtok". As I couldn't change the tokenization other researchers chose, I had to develop my own solution. Hope it helps someone ;)
I wrote the function which finds homographs in a text.
A homograph is a word that shares the same written form as another
word but has a different meaning.
For this I've used POS-Tagger from NLTK(pos_tag).
POS-tagger processes a sequence of words, and attaches a part of
speech tag to each word.
For example:
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')].
Code(Edited):
def find_homographs(text):
homographs_dict = {}
if isinstance(text, str):
text = word_tokenize(text)
tagged_tokens = pos_tag(text)
for tag1 in tagged_tokens:
for tag2 in tagged_tokens:
try:
if dict1[tag2] == tag1:
continue
except KeyError:
if tag1[0] == tag2[0] and tag1[1] != tag2[1]:
dict1[tag1] = tag2
return homographs_dict
It works, But takes too much time, because I've used two cycles for. Please, advice me how can I simplify it and make much faster.
It may seem counterintuitive, but you can easily collect all POS tags for each word in your text, then keep just the words that have multiple tags.
from collections import defaultdict
alltags = defaultdict(set)
for word, tag in tagged_tokens:
alltags[word].add(tag)
homographs = dict((w, tags) for w, tags in alltags.items() if len(tags) > 1)
Note the two-variable loop; it's a lot handier than writing tag1[0] and tag1[1]. defaultdict (and set) you'll have to look up in the manual.
Your output format cannot handle words with three or more POS tags, so the dictionary homographs has words as keys and sets of POS tags as values.
And two more things I would advise: (1) convert all words to lower case to catch more "homographs"; and (2) nltk.pos_tag() expects to be called on one sentence at a time, so you'll get more correct tags if you sent_tokenize() your text and word_tokenize() and pos_tag() each sentence separately.
Here is a suggestion (not tested) but the main idea is to build a dictionary when parsing tagged_tokens, to identify homographs in non-nested loop:
temp_dict = dict()
for tag in tagged_tokens:
temp_dict[tag[0]] = temp_dict.get(tag[0],list()).append(tag[1])
for temp in temp_dict.items():
if len(temp[1]) == 1:
del temp_dict[temp [0]]
print (temp_dict)