Cluster sentences while considering synonyms - scikit-learn

I have the following code:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import pandas as pd
sentences = ["I have the ability", "I have the weakness", "I have the capability", "I have the power"]
tfidf = TfidfVectorizer(max_features=300)
tfidf.fit(sentences)
X = tfidf.transform(sentences)
k = 2
model = KMeans(n_clusters=k, random_state=1)
model.fit(X)
print(pd.DataFrame(columns=["sentence"], data=sentences).join(pd.DataFrame(columns=["cluster"], data=model.labels_)))
The output looks like this:
index
sentence
cluster
0
I have the ability
0
1
I have the weakness
0
2
I have the capability
0
3
I have the power
1
As you can see "I have the ability", "I have the weakness", "I have the capability" were grouped in the same cluster (cluster 0) and "I have the power" was grouped into a separate cluster. I think they were grouped randomly and it can't tell which sentences actually mean the same thing. I want a way to be able to group "I have the ability", "I have the capability", and "I have the power" together by specifying that ability, capability and power are synonyms. So basically mapping all words to their synonyms. Is there an existing package for this?

Related

NLP: Get opinionated terms that correspond to aspect terms

I want to extract the sentiment sentence that goes along an aspect term in a sentence. I have the following code:
import spacy
nlp = spacy.load("en_core_web_lg")
def find_sentiment(doc):
# find roots of all entities in the text
ner_heads = {ent.root.idx: ent for ent in doc.ents}
rule3_pairs = []
for token in doc:
children = token.children
A = "999999"
M = "999999"
add_neg_pfx = False
for child in children:
if(child.dep_ in ["nsubj"] and not child.is_stop): # nsubj is nominal subject
if child.idx in ner_heads:
A = ner_heads[child.idx].text
else:
A = child.text
if(child.dep_ in ["acomp", "advcl"] and not child.is_stop): # acomp is adjectival complement
M = child.text
# example - 'this could have been better' -> (this, not better)
if(child.dep_ == "aux" and child.tag_ == "MD"): # MD is modal auxiliary
neg_prefix = "not"
add_neg_pfx = True
if(child.dep_ == "neg"): # neg is negation
neg_prefix = child.text
add_neg_pfx = True
# print(child, child.dep_)
if (add_neg_pfx and M != "999999"):
M = neg_prefix + " " + M
if(A != "999999" and M != "999999"):
rule3_pairs.append((A, M))
return rule3_pairs
print(find_sentiment(nlp('NEW DELHI Refined soya oil remained weak for the second day and prices shed 0.56 per cent to Rs 682.50 per 10 kg in futures market today as speculators reduced positions following sluggish demand in the spot market against adequate stocks position.')))
Which gets me the output: [('oil', 'weak'), ('prices', 'reduced')]
But this is too little of the content of the text
I want to know if it is possible to get an output like: [('oil', 'weak'), ('prices', 'shed 0.56 percent'), ('demand', 'sluggish')]
Is there any approach you recomend trying?
I triedthe code given above. Also a another library of stanza which only got similar results.
Unfortunately, if your task is to extract all expressive words from the text (all the words that contain sentimental significance), then it is not possible with the current state of affairs. Language is highly variable, and the same word could change its sentiment and meaning from sentence to sentence. While words like "awful" are easy to classify as negative, "demand" from your text is not as obvious, not even speaking about edge cases when seemingly positive "incredible" may reverse its sentiment if used as empowerment: "incredibly stupid" should be classified as very negative, but machines can normally only output two opposite labels for those words.
This is why for purposes of sentimental analysis, the only reliable way is building machine learning model that will classify texts entirely, which means you should adapt your software to accept the final verdict and process it in some way or another.
Naive Bayes Classifier
The simplest way to classify text by sentiment is the Naive Bayes classifier algorithm (that, among other things, not only classifies sentiment) that is implemented in NLTK:
from nltk import NaiveBayesClassifier, classify
#The training data is a two-dimensional list of words to classify.
train_data = dataset[:7000]
test_data = dataset[7000:]
#Train method returns the trained model.
classifier = NaiveBayesClassifier.train(train_data)
#To get accuracy, use classify.accuracy method:
print("Accuracy is:", classify.accuracy(classifier, test_data))
In order to make a prediction, we need to pass a list of words. It's preferable to remove any words that do not play sentimental significance such as the stop words and punctuation so that it wouldn't disturb our model:
from nltk.corpus import stopwords
from nltk.tokenise import word_tokenise
def clearLexemes(words):
return [word if word not in stopwords.word("english")
or "!?<>:;.&*%^" in word for word in words]
text = "What a terrible day!"
tokens = clearLexemes(word_tokenise(text))
print("Text sentiment is " + str(classifier.classify(dict([token, True] for token in tokens)))))
The output will be the sentiment of the text.
The important notes:
requires a minimum parameters to train and trains relatively fast;
is highly efficient for working with natural languages (is also used for gender identification and named entity recognition);
is unlikely to properly classify edge cases when words shift their sentiment in creatively-styled or rare utterances. For example, "Sweetheart, I wish ll of your fears would come true and you will be happy to live in such world!" This sentence is negative and uses irony to mask negative attribute through positive expressions, and the model may not be able to detect this.
Linear Regression
Another related method is to use linear regression algorithms from your favourite machine learning framework. In this notebook I used the Amazon food review dataset
to measure how fast model accuracy increases as you feed it with more and more data. The data you need to feed the model is the raw text and its score label (that in your case could be sentiment).
import numpy as np #For converting strings to text
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix, classification_report
#Preparing the data
ys: pd.DataFrame = reviews.head(170536) #30% of the dataframe is test data
xs: pd.DataFrame = reviews[170537:] #70% of the dataframe is training data
#Training the model
lr = LogisticRegression(max_iter=1000)
cv = CountVectorizer(token_pattern=r'\b\w+\b')
train = cv.fit_transform(xs["Summary"].apply(lambda x: np.str_(x)))
test = cv.transform(ys["Summary"].apply(lambda x: np.str_(x)))
lr.fit(train, xs["Score"])
#Measuring accuracy:
predictions = lr.predict(test)
labels = ["x1", "x2", "x3", "x4", "x5"]
report = classification_report(predictions, ys["Score"],
target_names = labels, output_dict=True)
accuracy = [report[label]["precision"] for label in labels]
print(accuracy)
Conclusion
Investigating sentimental analysis is a worthwhile area of academic and industrial research that completely relies on machine learning and is bound to its limitations. It is a powerful topic that should be covered in the classical NLP suite. Unfortunately, currently understanding meaning close enough to be able to extract situational meaning is a feat close to inventing Artificial General Intelligence, however technology rapidly grows in that direction.

How to count specific terms in tokenized sentences wthin a pandas df

I'm new to Python and nltk, so I would really appreciate your input on the following problem.
Goal:
I want to search and count the occurrence of specific terminology in tokenized sentences which are stored in a pandas DataFrame. The terms I'm searching for are stored in a list of strings. The output should be saved in a new column.
Since the words I'm searching for are grammatically inflected (e.g. cats instead of cat) I need a solution which not only displays exact matches. I guess stemming the data and searching for specific stems would be a proper approach but let's assume this is not an option here, as we would still have semantic overlaps.
What I tried so far:
In order to further handle the data I preprocessed the data while following these steps:
Put everything in lower case
Remove punctuation
Tokenization
Remove stop words
I tried searching for single terms with str.count('cat') but this doesn't do the trick and the data is marked as missing with NaN. Additionally, I don't know how to iterate over the search word list in an efficient way while using pandas.
My code so far:
import numpy as np
import pandas as pd
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
# Function to remove punctuation
def remove_punctuation(text):
return re.sub(r'[^\w\s]','',text)
# Target data where strings should be searched and counted
data = {'txt_body': ['Ab likes dogs.', 'Bc likes cats.',
'De likes cats and dogs.', 'Fg likes cats, dogs and cows.',
'Hi has two grey cats, a brown cat and two dogs.']}
df = pd.DataFrame(data=data)
# Search words stored in a list of strings
search_words = ['dog', 'cat', 'cow']
# Store stopwords from nltk.corpus
stop_words = set(stopwords.words('english'))
# Data preprocessing
df['txt_body'] = df['txt_body'].apply(lambda x: x.lower())
df['txt_body'] = df['txt_body'].apply(remove_punctuation)
df['txt_body'] = df['txt_body'].fillna("").map(word_tokenize)
df['txt_body'] = df['txt_body'].apply(lambda x: [word for word in x if word not in stop_words])
# Here is the problem space
df['search_count'] = df['txt_body'].str.count('cat')
print(df.head())
Expected output:
txt_body search_count
0 [ab, likes, dogs] 1
1 [bc, likes, cats] 1
2 [de, likes, cats, dogs] 2
3 [fg, likes, cats, dogs, cows] 3
4 [hi, two, grey, cats, brown, cat, two, dogs] 3
A very simple solution would be this:
def count_occurence(l, s):
counter = 0
for item in l:
if s in item:
counter += 1
return counter
df['search_count'] = df.apply(lambda row: count_occurence(row.txt_body, 'cat'),1)
You could then further decide how to define the count_occurence function. And, to search for the whole search_words, something like this will do the job, although it is probably not the most efficient:
def count_search_words(l, search_words):
counter = 0
for s in search_words:
counter += count_occurence(l, s)
return counter
df['search_count'] = df.apply(lambda row: count_search_words(row.txt_body, search_words),1)

word frequency with TfidfVectorizer

I'm trying to calculate the word frequency for a messaging dataframe using TF-IDF. So far I have this
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
new_group['tokenized_sents'] = new_group.apply(lambda row: nltk.word_tokenize(row['message']),axis=1).astype(str).lower()
vectoriser=TfidfVectorizer()
new_group['tokenized_vector'] = list(vectoriser.fit_transform(new_group['tokenized_sents']).toarray())
However with the code above I get a bunch of zeros instead of the words frequency. How can I fix this to get the correct number frenquncy for the messages. This is my dataframe
user_id date message tokenized_sents tokenized_vector
X35WQ0U8S 2019-02-17 Need help ['need','help'] [0.0,0.0]
X36WDMT2J 2019-03-22 Thank you! ['thank','you','!'] [0.0,0.0,0.0]
First of all for the counts, you don't want to use TfidfVectorizer as it is normalized. You want to use CountVectorizer. Second, you dont need to tokenize the words as sklearn has a build in tokenizer with both TfidfVectorizer and CountVectorizer.
#add whatever settings you want
countVec =CountVectorizer()
#fit transform
cv = countVec.fit_transform(df['message'].str.lower())
#feature names
cv_feature_names = countVec.get_feature_names()
#feature counts
feature_count = cv.toarray().sum(axis = 0)
#feature name to count
dict(zip(cv_feature_names, feature_count))

How to find the score for sentence Similarity using Word2Vec

I am new to NLP, how to find the similarity between 2 sentences and also how to print scores of each word. And also how to implement the gensim word2Vec model.
Try this code:
here my two sentences :
sentence1="I am going to India"
sentence2=" I am going to Bharat"
from gensim.models import word2vec
import numpy as np
words1 = sentence1.split(' ')
words2 = sentence2.split(' ')
#The meaning of the sentence can be interpreted as the average of its words
sentence1_meaning = word2vec(words1[0])
count = 1
for w in words1[1:]:
sentence1_meaning = np.add(sentence1_meaning, word2vec(w))
count += 1
sentence1_meaning /= count
sentence2_meaning = word2vec(words2[0])
count = 1
for w in words2[1:]:
sentence2_meaning = np.add(sentence2_meaning, word2vec(w))
count += 1
sentence2_meaning /= count
#Similarity is the cosine between the vectors
similarity = np.dot(sentence1_meaning, sentence2_meaning)/(np.linalg.norm(sentence1_meaning)*np.linalg.norm(sentence2_meaning))
You can train the model and use the similarity function to get the cosine similarity between two words.
Here's a simple demo:
from gensim.models import Word2Vec
from gensim.test.utils import common_texts
model = Word2Vec(common_texts,
size = 500,
window = 5,
min_count = 1,
workers = 4)
word_vectors = model.wv
word_vectors.similarity('computer', 'computer')
The output will be 1.0, of course, which indicates 100% similarity.
After your from gensim.models import word2vec, word2vec is a Python module – not a function that you can call as word2vec(words1[0]) or word2vec(w).
So your code isn't even close to approaching this correctly, and you should review docs/tutorials which demonstrate the proper use of the gensim Word2Vec class & supporting methods, then mimic those.
As #david-dale mentions, there's a basic intro in the gensim docs for Word2Vec:
https://radimrehurek.com/gensim/models/word2vec.html
The gensim library also bundles within its docs/notebooks directory a number of Jupyter notebooks demonstrating various algorithms & techniques. The notebook word2vec.ipynb shows basic Word2Vec usage; you can also view it via the project's source code repository at...
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/word2vec.ipynb
...however, it's really best to run as a local notebook, so you can step through the execution cell-by-cell, and try different variants yourself, perhaps even adapting it to use your data instead.
When you reach that level, note that:
these models require far more than just a few sentences as training - so ideally you'd either have (a) many sentences from the same domain as those you're comparing, so that the model can learn words in those contexts; (b) a model trained from a compatible corpus, which you then apply to your out-of-corpus sentences.
using the average of all the word-vectors in a sentence is just one relatively-simple way to make a vector for a longer text; there are many other more-sophisticated ways. One alternative very similar to Word2Vec is the 'Paragraph Vector' algorithm also available in gensim as the class Doc2Vec.

How to check for unreadable OCRed text with NLTK

I am using NLTK to analyze a corpus that has been OCRed. I'm new to NLTK. Most of the OCR is good -- but sometimes I come across lines that are plainly junk. For instance: oomfi ow Ba wmnondmam BE wBwHo<oBoBm. Bowman as: Ham: 8 ooww om $5
I want to identify (and filter out) such lines from my analysis.
How do NLP practitioners handle this situation? Something like: if 70 % of the words in the sentence are not in wordnet, discard. Or if NLTK can't identify the part of speech for 80% of the word, then discard? What algorithms work for this? Is there a "gold standard" way to do this?
Using n-grams is probably your best option. You can use google n-grams, or you can use n-grams built into nltk. The idea is to create a language model and see what probability any given sentence gets. You can define a probability threshold, and all sentences with scores below it are removed. Any reasonable language model will give a very low score for the example sentence.
If you think that some words may be only slightly corrupted, you may try spelling correction before testing with the n-grams.
EDIT: here is some sample nltk code for doing this:
import math
from nltk import NgramModel
from nltk.corpus import brown
from nltk.util import ngrams
from nltk.probability import LidstoneProbDist
n = 2
est = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(n, brown.words(categories='news'), estimator=est)
def sentenceprob(sentence):
bigrams = ngrams(sentence.split(), n)
sentence = sentence.lower()
tot = 0
for grams in bigrams:
score = lm.logprob(grams[-1], grams[:-1])
tot += score
return tot
sentence1 = "This is a standard English sentence"
sentence2 = "oomfi ow Ba wmnondmam BE wBwHo<oBoBm. Bowman as: Ham: 8 ooww om $5"
print sentenceprob(sentence1)
print sentenceprob(sentence2)
The results look like:
>>> python lmtest.py
42.7436688972
158.850086668
Lower is better. (Of course, you can play with the parameters).

Resources