finding the POS of the root of a noun_chunk with spacy - nlp

When using spacy you can easily loop across the noun_phrases of a text as follows:
S='This is an example sentence that should include several parts and also make clear that studying Natural language Processing is not difficult'
nlp = spacy.load('en_core_web_sm')
doc = nlp(S)
[chunk.text for chunk in doc.noun_chunks]
# = ['an example sentence', 'several parts', 'Natural language Processing']
You can also get the "root" of the noun chunk:
[chunk.root.text for chunk in doc.noun_chunks]
# = ['sentence', 'parts', 'Processing']
How can I get the POS of every of those words (even if looks like the root of a noun_phrase is always a noun), and how can I get the lemma, the shape and the word in singular of that particular word.
Is that even possible?
thx.

Each chunk.root is a Token where you can get different attributes including lemma_ and pos_ (or tag_ if you prefer the PennTreekbak POS tags).
import spacy
S='This is an example sentence that should include several parts and also make ' \
'clear that studying Natural language Processing is not difficult'
nlp = spacy.load('en_core_web_sm')
doc = nlp(S)
for chunk in doc.noun_chunks:
print('%-12s %-6s %s' % (chunk.root.text, chunk.root.pos_, chunk.root.lemma_))
sentence NOUN sentence
parts NOUN part
Processing NOUN processing
BTW... In this sentence "processing" is a noun so the lemma of it is "processing", not "process" which is the lemma of the verb "processing".

Related

singularize noun phrases with spacy

I am looking for a way to singularize noun chunks with spacy
S='There are multiple sentences that should include several parts and also make clear that studying Natural language Processing is not difficult '
nlp = spacy.load('en_core_web_sm')
doc = nlp(S)
[chunk.text for chunk in doc.noun_chunks]
# = ['an example sentence', 'several parts', 'Natural language Processing']
You can also get the "root" of the noun chunk:
[chunk.root.text for chunk in doc.noun_chunks]
# = ['sentences', 'parts', 'Processing']
I am looking for a way to singularize those roots of the chunks.
GOAL: Singulirized: ['sentence', 'part', 'Processing']
Is there any obvious way? Is that always depending on the POS of every root word?
Thanks
note:
I found this: https://www.geeksforgeeks.org/nlp-singularizing-plural-nouns-and-swapping-infinite-phrases/
but that approach looks to me that leads to many many different methods and of course different for every language. ( I am working in EN, FR, DE)
To get the basic form of each word, you can use ".lemma_" property of chunk or token property
I use Spacy version 2.x
import spacy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
doc = nlp('did displaying words')
print (" ".join([token.lemma_ for token in doc]))
and the output :
do display word
Hope it helps :)
There is! You can take the lemma of the head word in each noun chunk.
[chunk.root.lemma_ for chunk in doc.noun_chunks]
Out[82]: ['sentence', 'part', 'processing']

POS tagging a single word in spaCy

spaCy POS tagger is usally used on entire sentences. Is there a way to efficiently apply a unigram POS tagging to a single word (or a list of single words)?
Something like this:
words = ["apple", "eat", good"]
tags = get_tags(words)
print(tags)
> ["NNP", "VB", "JJ"]
Thanks.
English unigrams are often hard to tag well, so think about why you want to do this and what you expect the output to be. (Why is the POS of apple in your example NNP? What's the POS of can?)
spacy isn't really intended for this kind of task, but if you want to use spacy, one efficient way to do it is:
import spacy
nlp = spacy.load('en')
# disable everything except the tagger
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "tagger"]
nlp.disable_pipes(*other_pipes)
# use nlp.pipe() instead of nlp() to process multiple texts more efficiently
for doc in nlp.pipe(words):
if len(doc) > 0:
print(doc[0].text, doc[0].tag_)
See the documentation for nlp.pipe(): https://spacy.io/api/language#pipe
You can do something like this:
import spacy
nlp = spacy.load("en_core_web_sm")
word_list = ["apple", "eat", "good"]
for word in word_list:
doc = nlp(word)
print(doc[0].text, doc[0].pos_)
alternatively, you can do
import spacy
nlp = spacy.load("en_core_web_sm")
doc = spacy.tokens.doc.Doc(nlp.vocab, words=word_list)
for name, proc in nlp.pipeline:
doc = proc(doc)
pos_tags = [x.pos_ for x in doc]

An NLP Model that Suggest a List of Words in an Incomplete Sentence

I have somewhat read a bunch of papers which talks about predicting missing words in a sentence. What I really want is to create a model that suggest a word from an incomplete sentence.
Example:
Incomplete Sentence :
I bought an ___________ because its rainy.
Suggested Words:
umbrella
soup
jacket
From the journal I have read, they have utilized Microsoft Sentence Completion Dataset for predicting missing words from a sentence.
Example :
Incomplete Sentence :
Im sad because you are __________
Missing Word Options:
a) crying
b) happy
c) pretty
d) sad
e) bad
I don't want to predict a missing word from a list of options. I want to suggest a list of words from an incomplete sentence. Is it feasible? Please enlighten me cause Im really confused. What is state of the art model I can use for suggesting a list of words (semantically coherent) from an incomplete sentence?
Is it necessary that the list of suggested words as an output is included in the training dataset?
This is exactly how the BERT model was trained: mask some random words in the sentence, and make your network predict these words. So yes, it is feasible. And not, it is not necessary to have the list of suggested words as a training input. However, these suggested words should be the part of the overall vocabulary with which this BERT has been trained.
I adapted this answer to show how the completion function may work.
# install this package to obtain the pretrained model
# ! pip install -U pytorch-pretrained-bert
import torch
from pytorch_pretrained_bert import BertTokenizer, BertForMaskedLM
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval(); # turning off the dropout
def fill_the_gaps(text):
text = '[CLS] ' + text + ' [SEP]'
tokenized_text = tokenizer.tokenize(text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [0] * len(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
with torch.no_grad():
predictions = model(tokens_tensor, segments_tensors)
results = []
for i, t in enumerate(tokenized_text):
if t == '[MASK]':
predicted_index = torch.argmax(predictions[0, i]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
results.append(predicted_token)
return results
print(fill_the_gaps(text = 'I bought an [MASK] because its rainy .'))
print(fill_the_gaps(text = 'Im sad because you are [MASK] .'))
print(fill_the_gaps(text = 'Im worried because you are [MASK] .'))
print(fill_the_gaps(text = 'Im [MASK] because you are [MASK] .'))
The [MASK] symbol indicates the missing words (there can be any number of them). [CLS] and [SEP] are BERT-specific special tokens. The outputs for these particular prints are
['umbrella']
['here']
['worried']
['here', 'here']
The duplication is not surprising - transformer NNs are generally good at copying words. And from semantic point of view, these symmetric continuations look indeed very likely.
Moreover, if it is not a random word which is missing, but exactly the last word (or last several words), you can utilize any language model (e.g. another famous SOTA language model, GPT-2) to complete the sentence.

What does one do with the POS labelled as 'Conjunction' while WordNet lemmatization?

Simplified tags after the POS tagging by NLTK have been calculated.
simplified = [(word, simplify_wsj_tag(tag)) for word, tag in posTagged]
print(simplifiedTags)
#[('And', 'CONJ'), ('now', 'ADV'), ('for', 'ADP'), ('something', 'NOUN'), ('completely', 'ADV'), ('different', 'ADJ')]
Now the lemma for each word has to be found. Each of these, except conjuction, can be mapped to a wordnet POS class - noun, adjective, adverb, verb. What is supposed to be done with the words labelled as Conjuction? Which is the closest relative of conjuction amongst all the four classes? Or are they supposed to be dropped from the sentence all together?
I think we can use the default one for pos tagger which is noun for parts of speech other than VERB,ADVERB,ADJECTIVE,NOUN.
https://bommaritollc.com/2014/06/30/advanced-approximate-sentence-matching-python/
The above website's Approach #6 does the same thing.
Conjunctions are already in their lemma form, so you could skip them

Natural Language Processing Model

I'm a beginner in NLP and making a project to parse, and understand the intentions of input lines by a user in english.
Here is what I think I should do:
Create a text of sentences with POS tagging & marked intentions for every sentence by hand.
Create a model say: decision tree and train it on the above sentences.
Try the model on user input:
Do basic tokenizing and POS tagging on user input sentence and testing it on the above model for knowing the intention of this sentence.
It all may be completely wrong or silly but I'm determined to learn how to do it. I don't want to use ready-made solutions and the programming language is not a concern.
How would you guys do this task? Which model to choose and why? Normally to make NLP parsers, what steps are done.
Thanks
I would use NLTK.
There is an online book with a chapter on tagging, and a chapter on parsing. They also provide models in python.
Here is a simple example based on NLTK and Bayes
import nltk
import random
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)),category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)
]
random.shuffle(documents)
all_words = [w.lower() for w in movie_reviews.words()]
for w in movie_reviews.words():
all_words.append(w.lower())git b
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:3000]
def find_features(document):
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words)
return features
print((find_features(movie_reviews.words("neg/cv000_29416.txt"))))
featuresets = [(find_features(rev),category) for (rev,category) in documents ]
training_set =featuresets[:10]
testing_set = featuresets[1900:]
classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Naive Bayes Algo Accuracy: ",(nltk.classify.accuracy(classifier,testing_set))* 100 )

Resources