Implementing tf-idf in wordclouds - nlp

I have some google reviews for some universities in a dataframe like below df_unis. The column uni_name contains the university names. I wish to create word clouds for each university separately but in such a way that words appear larger in the word cloud of a university if they not only appear more frequently in the reviews of the university but also less frequently in the reviews of the other universities. I believe this is the idea behind tf-idf and upgrading from relying on simple frequency to "importance." I'm after this importance and the competitive insights from the word clouds for different universities when considered in tandem.
I have written this code but I suspect that since td-idf happens inside the loop, it's done for each university reviews separately and I'm not achieving my above goal. Is that right? I tried bringing td-idf outside the loop but then couldn't find a way to produce separate word clouds for each university. Any advice is highly appreciated.
df_unis = pd.DataFrame({'review' : ['this is a good school' , 'this school was not worth it', 'aah', 'mediocre school with good profs'],
uni_name': ['a', 'b', 'b', 'a']})
corpus = df_unis.groupby('uni_name')['review'].sum()
for group in corpus.index.unique():
try:
vectorizer = TfidfVectorizer(stop_words='English', ngram_range= ( 1 , 3 ) )
vecs = vectorizer.fit_transform([corpus.loc[group]])
feature_names = vectorizer.get_feature_names_out ()
dense = vecs.todense()
lst = dense.tolist()
tf_idf = pd.DataFrame(lst, columns=feature_names, index = [group])
cloud = WordCloud().generate_from_frequencies(tf_idf.T.sum(axis=1))

Related

NLP: Get opinionated terms that correspond to aspect terms

I want to extract the sentiment sentence that goes along an aspect term in a sentence. I have the following code:
import spacy
nlp = spacy.load("en_core_web_lg")
def find_sentiment(doc):
# find roots of all entities in the text
ner_heads = {ent.root.idx: ent for ent in doc.ents}
rule3_pairs = []
for token in doc:
children = token.children
A = "999999"
M = "999999"
add_neg_pfx = False
for child in children:
if(child.dep_ in ["nsubj"] and not child.is_stop): # nsubj is nominal subject
if child.idx in ner_heads:
A = ner_heads[child.idx].text
else:
A = child.text
if(child.dep_ in ["acomp", "advcl"] and not child.is_stop): # acomp is adjectival complement
M = child.text
# example - 'this could have been better' -> (this, not better)
if(child.dep_ == "aux" and child.tag_ == "MD"): # MD is modal auxiliary
neg_prefix = "not"
add_neg_pfx = True
if(child.dep_ == "neg"): # neg is negation
neg_prefix = child.text
add_neg_pfx = True
# print(child, child.dep_)
if (add_neg_pfx and M != "999999"):
M = neg_prefix + " " + M
if(A != "999999" and M != "999999"):
rule3_pairs.append((A, M))
return rule3_pairs
print(find_sentiment(nlp('NEW DELHI Refined soya oil remained weak for the second day and prices shed 0.56 per cent to Rs 682.50 per 10 kg in futures market today as speculators reduced positions following sluggish demand in the spot market against adequate stocks position.')))
Which gets me the output: [('oil', 'weak'), ('prices', 'reduced')]
But this is too little of the content of the text
I want to know if it is possible to get an output like: [('oil', 'weak'), ('prices', 'shed 0.56 percent'), ('demand', 'sluggish')]
Is there any approach you recomend trying?
I triedthe code given above. Also a another library of stanza which only got similar results.
Unfortunately, if your task is to extract all expressive words from the text (all the words that contain sentimental significance), then it is not possible with the current state of affairs. Language is highly variable, and the same word could change its sentiment and meaning from sentence to sentence. While words like "awful" are easy to classify as negative, "demand" from your text is not as obvious, not even speaking about edge cases when seemingly positive "incredible" may reverse its sentiment if used as empowerment: "incredibly stupid" should be classified as very negative, but machines can normally only output two opposite labels for those words.
This is why for purposes of sentimental analysis, the only reliable way is building machine learning model that will classify texts entirely, which means you should adapt your software to accept the final verdict and process it in some way or another.
Naive Bayes Classifier
The simplest way to classify text by sentiment is the Naive Bayes classifier algorithm (that, among other things, not only classifies sentiment) that is implemented in NLTK:
from nltk import NaiveBayesClassifier, classify
#The training data is a two-dimensional list of words to classify.
train_data = dataset[:7000]
test_data = dataset[7000:]
#Train method returns the trained model.
classifier = NaiveBayesClassifier.train(train_data)
#To get accuracy, use classify.accuracy method:
print("Accuracy is:", classify.accuracy(classifier, test_data))
In order to make a prediction, we need to pass a list of words. It's preferable to remove any words that do not play sentimental significance such as the stop words and punctuation so that it wouldn't disturb our model:
from nltk.corpus import stopwords
from nltk.tokenise import word_tokenise
def clearLexemes(words):
return [word if word not in stopwords.word("english")
or "!?<>:;.&*%^" in word for word in words]
text = "What a terrible day!"
tokens = clearLexemes(word_tokenise(text))
print("Text sentiment is " + str(classifier.classify(dict([token, True] for token in tokens)))))
The output will be the sentiment of the text.
The important notes:
requires a minimum parameters to train and trains relatively fast;
is highly efficient for working with natural languages (is also used for gender identification and named entity recognition);
is unlikely to properly classify edge cases when words shift their sentiment in creatively-styled or rare utterances. For example, "Sweetheart, I wish ll of your fears would come true and you will be happy to live in such world!" This sentence is negative and uses irony to mask negative attribute through positive expressions, and the model may not be able to detect this.
Linear Regression
Another related method is to use linear regression algorithms from your favourite machine learning framework. In this notebook I used the Amazon food review dataset
to measure how fast model accuracy increases as you feed it with more and more data. The data you need to feed the model is the raw text and its score label (that in your case could be sentiment).
import numpy as np #For converting strings to text
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix, classification_report
#Preparing the data
ys: pd.DataFrame = reviews.head(170536) #30% of the dataframe is test data
xs: pd.DataFrame = reviews[170537:] #70% of the dataframe is training data
#Training the model
lr = LogisticRegression(max_iter=1000)
cv = CountVectorizer(token_pattern=r'\b\w+\b')
train = cv.fit_transform(xs["Summary"].apply(lambda x: np.str_(x)))
test = cv.transform(ys["Summary"].apply(lambda x: np.str_(x)))
lr.fit(train, xs["Score"])
#Measuring accuracy:
predictions = lr.predict(test)
labels = ["x1", "x2", "x3", "x4", "x5"]
report = classification_report(predictions, ys["Score"],
target_names = labels, output_dict=True)
accuracy = [report[label]["precision"] for label in labels]
print(accuracy)
Conclusion
Investigating sentimental analysis is a worthwhile area of academic and industrial research that completely relies on machine learning and is bound to its limitations. It is a powerful topic that should be covered in the classical NLP suite. Unfortunately, currently understanding meaning close enough to be able to extract situational meaning is a feat close to inventing Artificial General Intelligence, however technology rapidly grows in that direction.

Is it possible to get classes on the WordNet dataset?

I am playing with WordNet and try to solve a NLP task.
I was wondering if there exists any way to get a list of words belonging to some large sets, such as "animals" (i.e. dog, cat, cow etc.), "countries", "electronics" etc.
I believe that it should be possible to somehow get this list by exploiting hypernyms.
Bonus question: do you know any other way to classify words in very large classes, besides "noun", "adjective" and "verb"? For example, classes like, "prepositions", "conjunctions" etc.
Yes, you just check if the category is a hypernym of the given word.
from nltk.corpus import wordnet as wn
def has_hypernym(word, category):
# Assume the category always uses the most popular sense
cat_syn = wn.synsets(category)[0]
# For the input, check all senses
for syn in wn.synsets(word):
for match in syn.lowest_common_hypernyms(cat_syn):
if match == cat_syn:
return True
return False
has_hypernym('dog', 'animal') # => True
has_hypernym('bucket', 'animal') # => False
If the broader word (the "category" here) is the lowest common hypernym, that means it's a direct hypernym of the query word, so the query word is in the category.
Regarding your bonus question, I have no idea what you mean. Maybe you should look at NER or open a new question.
With some help from polm23, I found this solution, which exploits similarity between words, and prevents wrong results when the class name is ambiguous.
The idea is that WordNet can be used to compare a list words, with the string animal, and compute a similarity score. From the nltk.org webpage:
Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node).
def keep_similar(words, similarity_thr):
similar_words=[]
w2 = wn.synset('animal.n.01')
[similar_words.append(word) for word in words if wn.synset(word + '.n.01').wup_similarity(w2) > similarity_thr ]
return similar_words
For example, if word_list = ['dog', 'car', 'train', 'dinosaur', 'London', 'cheese', 'radon'], the corresponding scores are:
0.875
0.4444444444444444
0.5
0.7
0.3333333333333333
0.3076923076923077
0.3076923076923077
This can easily be used to generate a list of animals, by setting a proper value of similarity_thr

nltk latent semantic analysis copies the first topics over and over

This is my first attempt with Natural Language Processing so I started with Latent Semantic Analysis and used this tutorial to build the algorithm. After testing it I see that it only classifies the first semantic words and repeats the same terms over and over on top of the other documents.
I tried feeding it the documents found in HERE too and it does exactly the same. Repeating the values of the same topic several times in the other ones.
Could anyone help explain what is happening? I've been searching all over and everything seems exactly like in the tutorials.
testDocs = [
"The Neatest Little Guide to Stock Market Investing",
"Investing For Dummies, 4th Edition",
"The Little Book of Common Sense Investing: The Only Way to Guarantee Your Fair Share of Stock Market Returns",
"The Little Book of Value Investing",
"Value Investing: From Graham to Buffett and Beyond",
"Rich Dad's Guide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!",
"Investing in Real Estate, 5th Edition",
"Stock Investing For Dummies",
"Rich Dad's Advisors: The ABC's of Real Estate Investing: The Secrets of Finding Hidden Profits Most Investors Miss",
]
stopwords = ['and','edition','for','in','little','of','the','to']
ignorechars = ''',:'!'''
#First we apply the standard SKLearn algorithm to compare with.
for element in testDocs:
#tokens.append(tokenizer.tokenize(element.lower()))
element = element.lower()
print(testDocs)
#Vectorize the features.
vectorizer = tfdv(max_df=0.5, min_df=2, max_features=8, stop_words='english', use_idf=True)#, ngram_range=(1,3))
#Store the values in matrix X.
X = vectorizer.fit_transform(testDocs)
#Apply LSA.
lsa = TruncatedSVD(n_components=3, n_iter=100)
lsa.fit(X)
#Get a list of the terms in the order it was decomposed.
terms = vectorizer.get_feature_names()
print("Terms decomposed from the document: " + str(terms))
print()
#Prints the matrix of concepts. Each number represents how important the term is to the concept and the position relates to the position of the term.
print("Number of components in element 0 of matrix of components:")
print(lsa.components_[0])
print("Shape: " + str(lsa.components_.shape))
print()
for i, comp in enumerate(lsa.components_):
#Stick each of the terms to the respective components. Zip command creates a tuple from 2 components.
termsInComp = zip(terms, comp)
#Sort the terms according to...
sortedTerms = sorted(termsInComp, key=lambda x: x[1], reverse=True)
print("Concept %d", i)
for term in sortedTerms:
print(term[0], end="\t")
print()

find bigram using gensim

this is my code
from gensim.models import Phrases
documents = ["the mayor of new york was there the hill have eyes","the_hill have_eyes new york mayor was present"]
sentence_stream = [doc.split(" ") for doc in documents]
bigram = Phrases(sentence_stream, min_count=1)
sent = ['the', 'mayor', 'of', 'new_york', 'was', 'there', 'the_hill', 'have_eyes']
print(bigram[sent])
i want it detects "the_hill_have_eyes" but the output is
['the', 'mayor', 'of', 'new_york', 'was', 'there', 'the_hill', 'have_eyes']
Phrases is a purely-statistical method for combining some unigram-token-pairs to new bigram-tokens. If it's not combining two unigrams you think should be combined, it's because the training data and/or chosen parameters (like threshold or min_count) don't imply that pairing should be combined.
Note especially that:
even when Phrases-combinations prove beneficial for downstream classification or info-retrieval steps, they may not intuitively/aesthetically match the "phrases" we as human readers would like to see
since Phrases requires bulk statistics for good results, it requires a lot of training data – you are unlikely to see impressive or representative results from tiny toy-sized training data
In particular with regard to that last point & your example, the interpretation of min_count in Phrases default-scoring means even a min_count=1 isn't low enough to cause bigrams for which there is only a single example in the training-corpus to be created.
So, if you expand your training-corpus a bit, you may be able to create the results you want. But you should still be aware that this method's only value comes from training are larger, realistic corpuses, so anything you see in tiny contrived examples may not generalize to real uses.
What you want is not actually bigrams but "fourgrams".
This can be achieved by doing something like this (my old piece of code I wrote some months ago):
// read the txt file
sentences = Text8Corpus(datapath('testcorpus.txt'))
phrases = Phrases(sentences, min_count=1, threshold=1)
bigram = Phraser(phrases)
sent = [u'trees', u'graph', u'minors']
// look for words in "sent"
print(bigram[sent])
[u'trees_graph', u'minors'] // output
// to create the bigrams
bigram_model = Phrases(unigram_sentences)
// apply the trained model to a sentence
for unigram_sentence in unigram_sentences:
bigram_sentence = u' '.join(bigram_model[unigram_sentence])
// get a trigram model out of the bigram
trigram_model = Phrases(bigram_sentences)
So here you have a trigram model (detecting 3 words together) and you get the idea on how to implement fourgrams.
Hope this helps. Good luck.

An NLP Model that Suggest a List of Words in an Incomplete Sentence

I have somewhat read a bunch of papers which talks about predicting missing words in a sentence. What I really want is to create a model that suggest a word from an incomplete sentence.
Example:
Incomplete Sentence :
I bought an ___________ because its rainy.
Suggested Words:
umbrella
soup
jacket
From the journal I have read, they have utilized Microsoft Sentence Completion Dataset for predicting missing words from a sentence.
Example :
Incomplete Sentence :
Im sad because you are __________
Missing Word Options:
a) crying
b) happy
c) pretty
d) sad
e) bad
I don't want to predict a missing word from a list of options. I want to suggest a list of words from an incomplete sentence. Is it feasible? Please enlighten me cause Im really confused. What is state of the art model I can use for suggesting a list of words (semantically coherent) from an incomplete sentence?
Is it necessary that the list of suggested words as an output is included in the training dataset?
This is exactly how the BERT model was trained: mask some random words in the sentence, and make your network predict these words. So yes, it is feasible. And not, it is not necessary to have the list of suggested words as a training input. However, these suggested words should be the part of the overall vocabulary with which this BERT has been trained.
I adapted this answer to show how the completion function may work.
# install this package to obtain the pretrained model
# ! pip install -U pytorch-pretrained-bert
import torch
from pytorch_pretrained_bert import BertTokenizer, BertForMaskedLM
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval(); # turning off the dropout
def fill_the_gaps(text):
text = '[CLS] ' + text + ' [SEP]'
tokenized_text = tokenizer.tokenize(text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [0] * len(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
with torch.no_grad():
predictions = model(tokens_tensor, segments_tensors)
results = []
for i, t in enumerate(tokenized_text):
if t == '[MASK]':
predicted_index = torch.argmax(predictions[0, i]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
results.append(predicted_token)
return results
print(fill_the_gaps(text = 'I bought an [MASK] because its rainy .'))
print(fill_the_gaps(text = 'Im sad because you are [MASK] .'))
print(fill_the_gaps(text = 'Im worried because you are [MASK] .'))
print(fill_the_gaps(text = 'Im [MASK] because you are [MASK] .'))
The [MASK] symbol indicates the missing words (there can be any number of them). [CLS] and [SEP] are BERT-specific special tokens. The outputs for these particular prints are
['umbrella']
['here']
['worried']
['here', 'here']
The duplication is not surprising - transformer NNs are generally good at copying words. And from semantic point of view, these symmetric continuations look indeed very likely.
Moreover, if it is not a random word which is missing, but exactly the last word (or last several words), you can utilize any language model (e.g. another famous SOTA language model, GPT-2) to complete the sentence.

Resources