I am looking for a way to singularize noun chunks with spacy
S='There are multiple sentences that should include several parts and also make clear that studying Natural language Processing is not difficult '
nlp = spacy.load('en_core_web_sm')
doc = nlp(S)
[chunk.text for chunk in doc.noun_chunks]
# = ['an example sentence', 'several parts', 'Natural language Processing']
You can also get the "root" of the noun chunk:
[chunk.root.text for chunk in doc.noun_chunks]
# = ['sentences', 'parts', 'Processing']
I am looking for a way to singularize those roots of the chunks.
GOAL: Singulirized: ['sentence', 'part', 'Processing']
Is there any obvious way? Is that always depending on the POS of every root word?
Thanks
note:
I found this: https://www.geeksforgeeks.org/nlp-singularizing-plural-nouns-and-swapping-infinite-phrases/
but that approach looks to me that leads to many many different methods and of course different for every language. ( I am working in EN, FR, DE)
To get the basic form of each word, you can use ".lemma_" property of chunk or token property
I use Spacy version 2.x
import spacy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
doc = nlp('did displaying words')
print (" ".join([token.lemma_ for token in doc]))
and the output :
do display word
Hope it helps :)
There is! You can take the lemma of the head word in each noun chunk.
[chunk.root.lemma_ for chunk in doc.noun_chunks]
Out[82]: ['sentence', 'part', 'processing']
Related
When using spacy you can easily loop across the noun_phrases of a text as follows:
S='This is an example sentence that should include several parts and also make clear that studying Natural language Processing is not difficult'
nlp = spacy.load('en_core_web_sm')
doc = nlp(S)
[chunk.text for chunk in doc.noun_chunks]
# = ['an example sentence', 'several parts', 'Natural language Processing']
You can also get the "root" of the noun chunk:
[chunk.root.text for chunk in doc.noun_chunks]
# = ['sentence', 'parts', 'Processing']
How can I get the POS of every of those words (even if looks like the root of a noun_phrase is always a noun), and how can I get the lemma, the shape and the word in singular of that particular word.
Is that even possible?
thx.
Each chunk.root is a Token where you can get different attributes including lemma_ and pos_ (or tag_ if you prefer the PennTreekbak POS tags).
import spacy
S='This is an example sentence that should include several parts and also make ' \
'clear that studying Natural language Processing is not difficult'
nlp = spacy.load('en_core_web_sm')
doc = nlp(S)
for chunk in doc.noun_chunks:
print('%-12s %-6s %s' % (chunk.root.text, chunk.root.pos_, chunk.root.lemma_))
sentence NOUN sentence
parts NOUN part
Processing NOUN processing
BTW... In this sentence "processing" is a noun so the lemma of it is "processing", not "process" which is the lemma of the verb "processing".
I have few patient's medical record text files which i got from the internet and i want to identify/find the files which are bad quality(misspelled words/special characters between the words/Erroneous words) and files with good quality(clean text).i want to build error detection model using text mining/NLP.
1)can someone please help me on the approach and solution for feature extraction and model selection.
2)Is there any medical corpus for medical records to identify the misspelled/Erroneous words.
If your goal is to simply correct these misspelled words to improve performance on whatever downstream task you want to do, then I can suggest a simple approach which has worked sufficiently well for me.
First tokenize your text (I recommend scispacy for medical text)
Identify possible "bad quality" words simply by the count of each unique word constructed from all the words in your corpus e.g. all words that occur <= 3 times
Add words that occur > 3 times in your corpus (we assume these are all correctly spelled) to a regular English dictionary. If your corpus is large, this is perfectly adequate for capturing medical terms. Otherwise use a medical dictionary e.g. UMLS, or https://github.com/glutanimate/wordlist-medicalterms-en to add the medical words not in a regular dictionary
Use pyspellchecker to identify the misspellings by using the Levenshtein Distance algorithm and comparing against our dictionary.
Replace the typos with what pyspellchecker thinks they should be.
A basic example:
import spacy
import scispacy
from collections import Counter
from spellchecker import SpellChecker
nlp = spacy.load('en_core_sci_md') # sciSpaCy
word_freq = Counter()
for doc in corpus:
tokens = nlp.tokenizer(doc)
tokenised_text = ""
for token in tokens:
tokenised_text = tokenised_text + token.text + " "
word_freq.update(tokenised_text.split())
infreq_words = [word for word in word_freq.keys() if word_freq[word] <= 3 and word[0].isdigit() == False]
freq_words = [word for word in word_freq.keys() if word_freq[word] > 3]
add_to_dictionary = " ".join(freq_words)
f=open("medical_dict.txt", "w+")
f.write(add_to_dictionary)
f.close()
spell = SpellChecker()
spell.distance = 1 # set the distance parameter to just 1 edit away - much quicker
spell.word_frequency.load_text_file('medical_dict.txt')
misspelled = spell.unknown(infreq_words)
misspell_dict = {}
for i, word in enumerate(misspelled):
if (word != spell.correction(word)):
misspell_dict[word] = spell.correction(word)
print(list(misspell_dict.items())[:10])
I would also recommend using regular expressions to fix any other "bad quality" words which can be systematically corrected.
You can do biobert to do contextual spelling check,
Link: https://github.com/dmis-lab/biobert
I am using spaCy to process sentences of a doc. Given one sentence, I'd like to get the previous and following sentence.
I can easily iterate over the sentences of the doc as following:
nlp_content = nlp(content)
sentences = nlp_content.sents
for idx, sent in enumerate(sentences):
But I can't get the sentence #idx-1 or #idx+1 from the sentence #idx.
Is there any function or property that could be useful there?
Thanks!
Nick
There isn't a built-in sentence index. You would need to iterate over the sentences once to create your own list of sentence spans to access them this way.
sentence_spans = tuple(doc.sents) # alternately: list(doc.sents)
I have somewhat read a bunch of papers which talks about predicting missing words in a sentence. What I really want is to create a model that suggest a word from an incomplete sentence.
Example:
Incomplete Sentence :
I bought an ___________ because its rainy.
Suggested Words:
umbrella
soup
jacket
From the journal I have read, they have utilized Microsoft Sentence Completion Dataset for predicting missing words from a sentence.
Example :
Incomplete Sentence :
Im sad because you are __________
Missing Word Options:
a) crying
b) happy
c) pretty
d) sad
e) bad
I don't want to predict a missing word from a list of options. I want to suggest a list of words from an incomplete sentence. Is it feasible? Please enlighten me cause Im really confused. What is state of the art model I can use for suggesting a list of words (semantically coherent) from an incomplete sentence?
Is it necessary that the list of suggested words as an output is included in the training dataset?
This is exactly how the BERT model was trained: mask some random words in the sentence, and make your network predict these words. So yes, it is feasible. And not, it is not necessary to have the list of suggested words as a training input. However, these suggested words should be the part of the overall vocabulary with which this BERT has been trained.
I adapted this answer to show how the completion function may work.
# install this package to obtain the pretrained model
# ! pip install -U pytorch-pretrained-bert
import torch
from pytorch_pretrained_bert import BertTokenizer, BertForMaskedLM
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval(); # turning off the dropout
def fill_the_gaps(text):
text = '[CLS] ' + text + ' [SEP]'
tokenized_text = tokenizer.tokenize(text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [0] * len(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
with torch.no_grad():
predictions = model(tokens_tensor, segments_tensors)
results = []
for i, t in enumerate(tokenized_text):
if t == '[MASK]':
predicted_index = torch.argmax(predictions[0, i]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
results.append(predicted_token)
return results
print(fill_the_gaps(text = 'I bought an [MASK] because its rainy .'))
print(fill_the_gaps(text = 'Im sad because you are [MASK] .'))
print(fill_the_gaps(text = 'Im worried because you are [MASK] .'))
print(fill_the_gaps(text = 'Im [MASK] because you are [MASK] .'))
The [MASK] symbol indicates the missing words (there can be any number of them). [CLS] and [SEP] are BERT-specific special tokens. The outputs for these particular prints are
['umbrella']
['here']
['worried']
['here', 'here']
The duplication is not surprising - transformer NNs are generally good at copying words. And from semantic point of view, these symmetric continuations look indeed very likely.
Moreover, if it is not a random word which is missing, but exactly the last word (or last several words), you can utilize any language model (e.g. another famous SOTA language model, GPT-2) to complete the sentence.
I'm a beginner in NLP and making a project to parse, and understand the intentions of input lines by a user in english.
Here is what I think I should do:
Create a text of sentences with POS tagging & marked intentions for every sentence by hand.
Create a model say: decision tree and train it on the above sentences.
Try the model on user input:
Do basic tokenizing and POS tagging on user input sentence and testing it on the above model for knowing the intention of this sentence.
It all may be completely wrong or silly but I'm determined to learn how to do it. I don't want to use ready-made solutions and the programming language is not a concern.
How would you guys do this task? Which model to choose and why? Normally to make NLP parsers, what steps are done.
Thanks
I would use NLTK.
There is an online book with a chapter on tagging, and a chapter on parsing. They also provide models in python.
Here is a simple example based on NLTK and Bayes
import nltk
import random
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)),category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)
]
random.shuffle(documents)
all_words = [w.lower() for w in movie_reviews.words()]
for w in movie_reviews.words():
all_words.append(w.lower())git b
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:3000]
def find_features(document):
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words)
return features
print((find_features(movie_reviews.words("neg/cv000_29416.txt"))))
featuresets = [(find_features(rev),category) for (rev,category) in documents ]
training_set =featuresets[:10]
testing_set = featuresets[1900:]
classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Naive Bayes Algo Accuracy: ",(nltk.classify.accuracy(classifier,testing_set))* 100 )