Natural Language Processing Model - nlp

I'm a beginner in NLP and making a project to parse, and understand the intentions of input lines by a user in english.
Here is what I think I should do:
Create a text of sentences with POS tagging & marked intentions for every sentence by hand.
Create a model say: decision tree and train it on the above sentences.
Try the model on user input:
Do basic tokenizing and POS tagging on user input sentence and testing it on the above model for knowing the intention of this sentence.
It all may be completely wrong or silly but I'm determined to learn how to do it. I don't want to use ready-made solutions and the programming language is not a concern.
How would you guys do this task? Which model to choose and why? Normally to make NLP parsers, what steps are done.
Thanks

I would use NLTK.
There is an online book with a chapter on tagging, and a chapter on parsing. They also provide models in python.

Here is a simple example based on NLTK and Bayes
import nltk
import random
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)),category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)
]
random.shuffle(documents)
all_words = [w.lower() for w in movie_reviews.words()]
for w in movie_reviews.words():
all_words.append(w.lower())git b
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:3000]
def find_features(document):
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words)
return features
print((find_features(movie_reviews.words("neg/cv000_29416.txt"))))
featuresets = [(find_features(rev),category) for (rev,category) in documents ]
training_set =featuresets[:10]
testing_set = featuresets[1900:]
classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Naive Bayes Algo Accuracy: ",(nltk.classify.accuracy(classifier,testing_set))* 100 )

Related

NLP: Get opinionated terms that correspond to aspect terms

I want to extract the sentiment sentence that goes along an aspect term in a sentence. I have the following code:
import spacy
nlp = spacy.load("en_core_web_lg")
def find_sentiment(doc):
# find roots of all entities in the text
ner_heads = {ent.root.idx: ent for ent in doc.ents}
rule3_pairs = []
for token in doc:
children = token.children
A = "999999"
M = "999999"
add_neg_pfx = False
for child in children:
if(child.dep_ in ["nsubj"] and not child.is_stop): # nsubj is nominal subject
if child.idx in ner_heads:
A = ner_heads[child.idx].text
else:
A = child.text
if(child.dep_ in ["acomp", "advcl"] and not child.is_stop): # acomp is adjectival complement
M = child.text
# example - 'this could have been better' -> (this, not better)
if(child.dep_ == "aux" and child.tag_ == "MD"): # MD is modal auxiliary
neg_prefix = "not"
add_neg_pfx = True
if(child.dep_ == "neg"): # neg is negation
neg_prefix = child.text
add_neg_pfx = True
# print(child, child.dep_)
if (add_neg_pfx and M != "999999"):
M = neg_prefix + " " + M
if(A != "999999" and M != "999999"):
rule3_pairs.append((A, M))
return rule3_pairs
print(find_sentiment(nlp('NEW DELHI Refined soya oil remained weak for the second day and prices shed 0.56 per cent to Rs 682.50 per 10 kg in futures market today as speculators reduced positions following sluggish demand in the spot market against adequate stocks position.')))
Which gets me the output: [('oil', 'weak'), ('prices', 'reduced')]
But this is too little of the content of the text
I want to know if it is possible to get an output like: [('oil', 'weak'), ('prices', 'shed 0.56 percent'), ('demand', 'sluggish')]
Is there any approach you recomend trying?
I triedthe code given above. Also a another library of stanza which only got similar results.
Unfortunately, if your task is to extract all expressive words from the text (all the words that contain sentimental significance), then it is not possible with the current state of affairs. Language is highly variable, and the same word could change its sentiment and meaning from sentence to sentence. While words like "awful" are easy to classify as negative, "demand" from your text is not as obvious, not even speaking about edge cases when seemingly positive "incredible" may reverse its sentiment if used as empowerment: "incredibly stupid" should be classified as very negative, but machines can normally only output two opposite labels for those words.
This is why for purposes of sentimental analysis, the only reliable way is building machine learning model that will classify texts entirely, which means you should adapt your software to accept the final verdict and process it in some way or another.
Naive Bayes Classifier
The simplest way to classify text by sentiment is the Naive Bayes classifier algorithm (that, among other things, not only classifies sentiment) that is implemented in NLTK:
from nltk import NaiveBayesClassifier, classify
#The training data is a two-dimensional list of words to classify.
train_data = dataset[:7000]
test_data = dataset[7000:]
#Train method returns the trained model.
classifier = NaiveBayesClassifier.train(train_data)
#To get accuracy, use classify.accuracy method:
print("Accuracy is:", classify.accuracy(classifier, test_data))
In order to make a prediction, we need to pass a list of words. It's preferable to remove any words that do not play sentimental significance such as the stop words and punctuation so that it wouldn't disturb our model:
from nltk.corpus import stopwords
from nltk.tokenise import word_tokenise
def clearLexemes(words):
return [word if word not in stopwords.word("english")
or "!?<>:;.&*%^" in word for word in words]
text = "What a terrible day!"
tokens = clearLexemes(word_tokenise(text))
print("Text sentiment is " + str(classifier.classify(dict([token, True] for token in tokens)))))
The output will be the sentiment of the text.
The important notes:
requires a minimum parameters to train and trains relatively fast;
is highly efficient for working with natural languages (is also used for gender identification and named entity recognition);
is unlikely to properly classify edge cases when words shift their sentiment in creatively-styled or rare utterances. For example, "Sweetheart, I wish ll of your fears would come true and you will be happy to live in such world!" This sentence is negative and uses irony to mask negative attribute through positive expressions, and the model may not be able to detect this.
Linear Regression
Another related method is to use linear regression algorithms from your favourite machine learning framework. In this notebook I used the Amazon food review dataset
to measure how fast model accuracy increases as you feed it with more and more data. The data you need to feed the model is the raw text and its score label (that in your case could be sentiment).
import numpy as np #For converting strings to text
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix, classification_report
#Preparing the data
ys: pd.DataFrame = reviews.head(170536) #30% of the dataframe is test data
xs: pd.DataFrame = reviews[170537:] #70% of the dataframe is training data
#Training the model
lr = LogisticRegression(max_iter=1000)
cv = CountVectorizer(token_pattern=r'\b\w+\b')
train = cv.fit_transform(xs["Summary"].apply(lambda x: np.str_(x)))
test = cv.transform(ys["Summary"].apply(lambda x: np.str_(x)))
lr.fit(train, xs["Score"])
#Measuring accuracy:
predictions = lr.predict(test)
labels = ["x1", "x2", "x3", "x4", "x5"]
report = classification_report(predictions, ys["Score"],
target_names = labels, output_dict=True)
accuracy = [report[label]["precision"] for label in labels]
print(accuracy)
Conclusion
Investigating sentimental analysis is a worthwhile area of academic and industrial research that completely relies on machine learning and is bound to its limitations. It is a powerful topic that should be covered in the classical NLP suite. Unfortunately, currently understanding meaning close enough to be able to extract situational meaning is a feat close to inventing Artificial General Intelligence, however technology rapidly grows in that direction.

How to generate sentence embedding using long-former model

I am using Hugging Face mrm8488/longformer-base-4096-finetuned-squadv2 pre-trained model
https://huggingface.co/mrm8488/longformer-base-4096-finetuned-squadv2.
I want to generate sentence level embedding. I have a data-frame which has a text column.
I am using this code:
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
ckpt = "mrm8488/longformer-base-4096-finetuned-squadv2"
tokenizer = AutoTokenizer.from_pretrained(ckpt)
model = AutoModelForQuestionAnswering.from_pretrained(ckpt)
text = "Huggingface has democratized NLP. Huge thanks to Huggingface for this." # I will pas text-column here from my data-frame
#question = "What has Huggingface done ?"
encoding = tokenizer(question, text, return_tensors="pt")
# I don't want to use it for Question-Answer use-case. I just need the sentence embeddings
input_ids = encoding["input_ids"]
# default is local attention everywhere
# the forward method will automatically set global attention on question tokens
attention_mask = encoding["attention_mask"]
How can I do modification in the above code to generate embedding for sentences. ?
I have the following examples:
Text
i've added notes to the claim and it's been escalated for final review
after submitting the request you'll receive an email confirming the open request.
hello my name is person and i'll be assisting you
this is sam and i'll be assisting you for date.
I'll return the amount as asap.
ill return it to you.
The Longformer uses a local attention mechanism and you need to pass a global attention mask to let one token attend to all tokens of your sequence.
import torch
from transformers import LongformerTokenizer, LongformerModel
ckpt = "mrm8488/longformer-base-4096-finetuned-squadv2"
tokenizer = LongformerTokenizer.from_pretrained(ckpt)
model = LongformerModel.from_pretrained(ckpt)
text = "Huggingface has democratized NLP. Huge thanks to Huggingface for this." # I will pas text-column here from my data-frame
#question = "What has Huggingface done ?"
encoding = tokenizer(text, return_tensors="pt")
global_attention_mask = [1].extend([0]*encoding["input_ids"].shape[-1])
encoding["global_attention_mask"] = global_attention_mask
# I don't want to use it for Question-Answer use-case. I just need the sentence embeddings
# default is local attention everywhere
# the forward method will automatically set global attention on question tokens
o = model(**encoding)
sentence_embedding = o.last_hidden_state[:,0]
You should keep in mind that mrm8488/longformer-base-4096-finetuned-squadv2 was not pre-trained to produce meaningful sentence embeddings and faces the same issues as the MLM pre-trained BERT's regarding sentence embeddings.

singularize noun phrases with spacy

I am looking for a way to singularize noun chunks with spacy
S='There are multiple sentences that should include several parts and also make clear that studying Natural language Processing is not difficult '
nlp = spacy.load('en_core_web_sm')
doc = nlp(S)
[chunk.text for chunk in doc.noun_chunks]
# = ['an example sentence', 'several parts', 'Natural language Processing']
You can also get the "root" of the noun chunk:
[chunk.root.text for chunk in doc.noun_chunks]
# = ['sentences', 'parts', 'Processing']
I am looking for a way to singularize those roots of the chunks.
GOAL: Singulirized: ['sentence', 'part', 'Processing']
Is there any obvious way? Is that always depending on the POS of every root word?
Thanks
note:
I found this: https://www.geeksforgeeks.org/nlp-singularizing-plural-nouns-and-swapping-infinite-phrases/
but that approach looks to me that leads to many many different methods and of course different for every language. ( I am working in EN, FR, DE)
To get the basic form of each word, you can use ".lemma_" property of chunk or token property
I use Spacy version 2.x
import spacy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
doc = nlp('did displaying words')
print (" ".join([token.lemma_ for token in doc]))
and the output :
do display word
Hope it helps :)
There is! You can take the lemma of the head word in each noun chunk.
[chunk.root.lemma_ for chunk in doc.noun_chunks]
Out[82]: ['sentence', 'part', 'processing']

An NLP Model that Suggest a List of Words in an Incomplete Sentence

I have somewhat read a bunch of papers which talks about predicting missing words in a sentence. What I really want is to create a model that suggest a word from an incomplete sentence.
Example:
Incomplete Sentence :
I bought an ___________ because its rainy.
Suggested Words:
umbrella
soup
jacket
From the journal I have read, they have utilized Microsoft Sentence Completion Dataset for predicting missing words from a sentence.
Example :
Incomplete Sentence :
Im sad because you are __________
Missing Word Options:
a) crying
b) happy
c) pretty
d) sad
e) bad
I don't want to predict a missing word from a list of options. I want to suggest a list of words from an incomplete sentence. Is it feasible? Please enlighten me cause Im really confused. What is state of the art model I can use for suggesting a list of words (semantically coherent) from an incomplete sentence?
Is it necessary that the list of suggested words as an output is included in the training dataset?
This is exactly how the BERT model was trained: mask some random words in the sentence, and make your network predict these words. So yes, it is feasible. And not, it is not necessary to have the list of suggested words as a training input. However, these suggested words should be the part of the overall vocabulary with which this BERT has been trained.
I adapted this answer to show how the completion function may work.
# install this package to obtain the pretrained model
# ! pip install -U pytorch-pretrained-bert
import torch
from pytorch_pretrained_bert import BertTokenizer, BertForMaskedLM
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval(); # turning off the dropout
def fill_the_gaps(text):
text = '[CLS] ' + text + ' [SEP]'
tokenized_text = tokenizer.tokenize(text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [0] * len(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
with torch.no_grad():
predictions = model(tokens_tensor, segments_tensors)
results = []
for i, t in enumerate(tokenized_text):
if t == '[MASK]':
predicted_index = torch.argmax(predictions[0, i]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
results.append(predicted_token)
return results
print(fill_the_gaps(text = 'I bought an [MASK] because its rainy .'))
print(fill_the_gaps(text = 'Im sad because you are [MASK] .'))
print(fill_the_gaps(text = 'Im worried because you are [MASK] .'))
print(fill_the_gaps(text = 'Im [MASK] because you are [MASK] .'))
The [MASK] symbol indicates the missing words (there can be any number of them). [CLS] and [SEP] are BERT-specific special tokens. The outputs for these particular prints are
['umbrella']
['here']
['worried']
['here', 'here']
The duplication is not surprising - transformer NNs are generally good at copying words. And from semantic point of view, these symmetric continuations look indeed very likely.
Moreover, if it is not a random word which is missing, but exactly the last word (or last several words), you can utilize any language model (e.g. another famous SOTA language model, GPT-2) to complete the sentence.

How to find the score for sentence Similarity using Word2Vec

I am new to NLP, how to find the similarity between 2 sentences and also how to print scores of each word. And also how to implement the gensim word2Vec model.
Try this code:
here my two sentences :
sentence1="I am going to India"
sentence2=" I am going to Bharat"
from gensim.models import word2vec
import numpy as np
words1 = sentence1.split(' ')
words2 = sentence2.split(' ')
#The meaning of the sentence can be interpreted as the average of its words
sentence1_meaning = word2vec(words1[0])
count = 1
for w in words1[1:]:
sentence1_meaning = np.add(sentence1_meaning, word2vec(w))
count += 1
sentence1_meaning /= count
sentence2_meaning = word2vec(words2[0])
count = 1
for w in words2[1:]:
sentence2_meaning = np.add(sentence2_meaning, word2vec(w))
count += 1
sentence2_meaning /= count
#Similarity is the cosine between the vectors
similarity = np.dot(sentence1_meaning, sentence2_meaning)/(np.linalg.norm(sentence1_meaning)*np.linalg.norm(sentence2_meaning))
You can train the model and use the similarity function to get the cosine similarity between two words.
Here's a simple demo:
from gensim.models import Word2Vec
from gensim.test.utils import common_texts
model = Word2Vec(common_texts,
size = 500,
window = 5,
min_count = 1,
workers = 4)
word_vectors = model.wv
word_vectors.similarity('computer', 'computer')
The output will be 1.0, of course, which indicates 100% similarity.
After your from gensim.models import word2vec, word2vec is a Python module – not a function that you can call as word2vec(words1[0]) or word2vec(w).
So your code isn't even close to approaching this correctly, and you should review docs/tutorials which demonstrate the proper use of the gensim Word2Vec class & supporting methods, then mimic those.
As #david-dale mentions, there's a basic intro in the gensim docs for Word2Vec:
https://radimrehurek.com/gensim/models/word2vec.html
The gensim library also bundles within its docs/notebooks directory a number of Jupyter notebooks demonstrating various algorithms & techniques. The notebook word2vec.ipynb shows basic Word2Vec usage; you can also view it via the project's source code repository at...
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/word2vec.ipynb
...however, it's really best to run as a local notebook, so you can step through the execution cell-by-cell, and try different variants yourself, perhaps even adapting it to use your data instead.
When you reach that level, note that:
these models require far more than just a few sentences as training - so ideally you'd either have (a) many sentences from the same domain as those you're comparing, so that the model can learn words in those contexts; (b) a model trained from a compatible corpus, which you then apply to your out-of-corpus sentences.
using the average of all the word-vectors in a sentence is just one relatively-simple way to make a vector for a longer text; there are many other more-sophisticated ways. One alternative very similar to Word2Vec is the 'Paragraph Vector' algorithm also available in gensim as the class Doc2Vec.

Resources