getting same tfidf values for different words - python-3.x

So i am doing a project regarding twitter sentiment analysis, wherein i happen to need to use TFIDF on the collected tweets.
So i converted the list of tweets into a single string and fed that to the object, the problem is that i am getting identical values for most of the words with some different values but they are also very frequent. Why is this happening ? Is it cause i am using a single string as input?
here is the code https://trinket.io/python/9c2daed912
Here is the screenshot, as you can see many have same TFIDF values

The frequency can be the same number for different words based on how many times it occurs. My code for project gutenberg results in the following.
from sklearn.feature_extraction.text import TfidfVectorizer
tfvect = TfidfVectorizer(stop_words='english')
### The corpus is from Project Gutenberg after all the text cleanup.
karlmarx_freq = tfvect.fit_transform(gutenberg_KarlMarx_Corpus)
tftermFreq = pd.DataFrame(karlmarx_freq.toarray(),columns=tfvect.get_feature_names())
tfsumdf = tftermFreq.sum(axis=0)
pd.DataFrame({'Vocab': tfsumdf.index, 'Frequency': tfsumdf.values}).sort_values(by='Frequency', ascending=False)
The results are:
1412 production 0.177513
345 conditions 0.174032
1151 modern 0.142706
1704 social 0.128784
1000 labour 0.128784
641 existence 0.111381
1705 socialism 0.104419
1117 means 0.100939
923 industry 0.100939
For details regarding, how this calculation is made, please refer to scikit-learn documentation.
Tf-Idf = Term frequency * Inverse Document frequency

Related

SBERT's Interpretability: can we better understand the final results?

I'm familiar with SBERT and its pre-trained models and they are amazing! But at the same time, I want to understand how the results are calculated, and I can't find anything more specific in their website.
For example, I have a document and I want to find other documents that are similar to it. I used 2 documents containing 200-250 words each (I changed the model.max_seq_length to 350 so the model can handle bigger texts), and in the end we can see that the cosine-similarity is 0.79. Is that all we can see? Is there a way to extract the main phrases/keywords that made the model return this high value of similarity?
Thanks in advance!
Have you tried to make either a simple word-count-comparison between the two documents and other random documents? Or a tf-idf, if the two documents are part of a bigger corpus?
Another thing you can do, is to look inside the "stored_embeddings" matrix (see code below and here) in which SBERT encodes your sentences (e.g. for 20.000 documents you'll get a 20.000*384 matrix), after having saved it into a pickle file like:
from sentence_transformers import SentenceTransformer
import pickle
embeddings = model.encode(sentences)
with open('embeddings.pkl', "wb") as fOut:
pickle.dump({'sentences': sentences, 'embeddings': embeddings}, fOut, protocol=pickle.HIGHEST_PROTOCOL)
with open('embeddings.pkl', "rb") as fIn:
stored_data = pickle.load(fIn)
stored_embeddings = stored_data['embeddings']
The stored embeddings variable can be handled as a numpy matrix and can therefore be (for example) indexed to access single elements. By looking at the values of the single 384 dimensions (to do this, you can go column by column but in case of a big matrix I suggest you not to .enumerate(), it'll take forever) and compare the values that the two documents take in one precise dimension. You can see which dimension has the highest values or variance, for example.
I'm not saying it'll be interpretable what you'll find, but at least you can try and see what you find.

finding semantic similarity between 2 statements

I am currently working with small application in python and my application has search functionality (currently using difflib) but I want to create Semantic Search which can give top 5 or 10 results from my database, based on user inputted text. It is same as google search engine works. I found some solutions Here.
But the problem is, below two statements from one of solution are semantically incorrect. And I don't care about this. because they are making things too hard which I don't want And also solution will be some pretrained neural network model or library from which I can implement easily.
Pete and Rob have found a dog near the station.
Pete and Rob have never found a dog near the station
And also I found some solutions which are showing using gensim and Glove embeddings and finding similarity between words and not sentences.
What I wanted ?
Suppose my db has statement display classes and user inputs show, showed, displayed, displayed class, show types etc are same. And if above 2 statements are given as same then also I don't care. displayed and displayed class already showing in difflib.
Points to be noted
Find from fixed set of statements but user inputted statements can differ
Must work for statements
I think it is not gensim embedding. It is word2vec embedding. Whatever it is.
You need tensorflow_hub
The Universal Sentence Encoder encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.
I believe you need here is Text Classification or Semantic Similarity because you want to find nearest top 5 or 10 statements given statement from user.
It is easy to use. But size of model is ≈ 1GB. It works with words, sentences, phrases or short paragraphs. The input is variable length English text and the output is a 512 dimensional vector. You can find more information about it Here
Code
import tensorflow_hub as hub
import numpy as np
# Load model. It will download first time.
module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/5"
model = hub.load(module_url)
# first data[0] is your actual value
data = ["display classes", "show", "showed" ,"displayed class", "show types"]
# find high-dimensional vectors.
vecs = model(data)
# find distance between statements using inner product
dists = np.inner(vecs[0], vecs)
# print dists
print(dists)
Output
array([0.9999999 , 0.5633253 , 0.46475542, 0.85303843, 0.61701006],dtype=float32)
Conclusion
First value 0.999999 is distance between display classes and display classes itself. second 0.5633253 is distance between display classes and show and last 0.61701006 is distance between display classes and show types.
Using this, you can find distance between given input and statements in db. then rank them according to distance.
You can use wordnet for finding synonyms and then use these synonyms for finding similar statements.
import nltk
from nltk.corpus import wordnet as wn
nltk.download('wordnet')
def get_syn_list(gword):
syn_list = []
try:
syn_list.extend(wn.synsets(gword,pos=wn.NOUN))
syn_list.extend(wn.synsets(gword,pos=wn.VERB))
syn_list.extend(wn.synsets(gword,pos=wn.ADJ))
syn_list.extend(wn.synsets(gword,pos=wn.ADV))
except :
print("Something Wrong Happened")
syn_words = []
for i in syn_list:
syn_words.append(i.lemmas()[0].name())
return syn_words
Now use split and split your statements in db. like this
stat = ["display classes"]
syn_dict = {}
for i in stat:
tmp = []
for x in i.split(" "):
tmp.extend(get_syn_list(x))
syn_dict[i] = set(tmp)
Now you have synonyms just compare them with inputted text. And use lemmatizer before comparing words so that displayed become display.
Hey you can use spacy
This answer is from https://medium.com/better-programming/the-beginners-guide-to-similarity-matching-using-spacy-782fc2922f7c
import spacy
nlp = spacy.load("en_core_web_lg")
doc1 = nlp("display classes")
doc2 = nlp("show types")
print(doc1.similarity(doc2))
Output
0.6277548513279427
Edit
Run following command, which will download model.
!python -m spacy download en_core_web_lg

How to interpret Python NLTK bigram likelihood ratios?

I'm trying to figure out how to properly interpret nltk's "likelihood ratio" given the below code (taken from this question).
import nltk.collocations
import nltk.corpus
import collections
bgm = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(nltk.corpus.brown.words())
scored = finder.score_ngrams(bgm.likelihood_ratio)
# Group bigrams by first word in bigram.
prefix_keys = collections.defaultdict(list)
for key, scores in scored:
prefix_keys[key[0]].append((key[1], scores))
for key in prefix_keys:
prefix_keys[key].sort(key = lambda x: -x[1])
prefix_keys['baseball']
With the following output:
[('game', 32.11075451975229),
('cap', 27.81891372457088),
('park', 23.509042621473505),
('games', 23.10503351305401),
("player's", 16.22787286342467),
('rightfully', 16.22787286342467),
[...]
Looking at the docs, it looks like the likelihood ratio printed next to each bigram is from
"Scores ngrams using likelihood ratios as in Manning and Schutze
5.3.4."
Referring to this article, which states on pg. 22:
One advantage of likelihood ratios is that they have a clear intuitive
interpretation. For example, the bigram powerful computers is
e^(.5*82.96) = 1.3*10^18 times more likely under the hypothesis that
computers is more likely to follow powerful than its base rate of
occurrence would suggest. This number is easier to interpret than the
scores of the t test or the 2 test which we have to look up in a
table.
What I'm confused about is what would be the "base rate of occurence" in the event that I'm using the nltk code noted above with my own data. Would it be safe to say, for example, that "game" is 32 times more likely to appear next to "baseball" in the current dataset than in the average use of the standard English language? Or is it that "game" is more likely to appear next to "baseball" than other words appearing next to "baseball" within the same set of data?
Any help/guidance towards a clearer interpretation or example is much appreciated!
nltk does not have a universal corpus of English language usage from which to model the probability of 'game' following 'baseball'.
using the corpus it does have available, the likelihood is calculated as the posterior probability of ‘baseball’ given the word before being ‘game’.
nltk.corpus.brown
is a built in corpus, or set of observations, and the predictive power of any probability-based model is entirely defined by the observations used to construct or train it.
nltk.collocations.BigramAssocMeasures().raw_freq
models raw frequency with t tests, not well suited to sparse data such as bigrams, thus the provision of the likelihood ratio.
The likelihood ratio as calculated by Manning and Schutze is not equivalent to frequency.
https://nlp.stanford.edu/fsnlp/promo/colloc.pdf
Section 5.3.4 describes their calculations in detail on how the calculation is done.
The likelihood can be infinitely large.
This chart may be helpful:
The likelihood is calculated as the leftmost column.

Scikit-Learn Vectorizer `max_features`

How do I choose the number of the max_features parameter in TfidfVectorizer module? Should I use the maximum number of elements in the data?
The description of the parameter does not give me a clear vision of how to choose the value for it:
max_features : int or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
This parameter is ignored if vocabulary is not None.
This parameter is absolutely optional and should be calibrated according to the rational thinking and the data structure.
Sometimes it is not effective to transform the whole vocabulary, as the data may have some exceptionally rare words, which, if passed to TfidfVectorizer().fit(), will add unwanted dimensions to inputs in the future. One of the appropriate techniques in this case, for instance, would be to print out word frequences accross documents and then set a certain threshold for them. Imagine you have set a threshold of 50, and your data corpus consists of 100 words. After looking at the word frequences 20 words occur less than 50 times. Thus, you set max_features=80 and you are good to go.
If max_features is set to None, then the whole corpus is considered during the TF-IDF transformation. Otherwise, if you pass, say, 5 to max_features, that would mean creating a feature matrix out of the most 5 frequent words accross text documents.
Quick example
Assume you work with hardware-related documents. Your raw data is the following:
from sklearn.feature_extraction.text import TfidfVectorizer
data = ['gpu processor cpu performance',
'gpu performance ram computer',
'cpu computer ram processor jeans']
You see the word jeans in the third document is hardly related and occures only once in the whole dataset. The best way to omit the word, of course, would be to use stop_words parameter, but imagine if there are plenty of such words; or words that are related to the topic but occur scarcely. In the second case, the max_features parameter might help. If you proceed with max_features=None, then it will create a 3x7 sparse matrix, while the best-case scenario would be 3x6 matrix:
tf = TfidfVectorizer(max_features=None).fit(data)
tf.vocabulary_.__len__() # returns 7 as we passed 7 words
tf.fit_transform(data) # returns 3x7 sparse matrix
tf = TfidfVectorizer(max_features=6).fit(data) # excluding 'jeans'
tf.vocabulary_ # prints out every words except 'jeans'
tf.vocabulary_.__len__() # returns 6
tf.fit_transform(data) # returns 3x6 sparse matrix

Sentiment analysis for sentences- positive, negative and neutral

I'm designing a text classifier in Python using NLTK. One of the features considered in every sentence is it's sentiment. I want to weight sentences with either positive or negative sentiments more that those without any sentiment(neutral sentences). Using the movie review corpus along with the naive bayes classifier results in only positive and negative labels. I tried using demo_liu_hu_lexicon in nltk.sentiment.utils but the function does not return any values but instead prints it to the output and is very slow. Does anyone know of a library which gives some sort of weight to sentences based on sentiment?
Thanks!
Try the textblob module:
from textblob import TextBlob
text = '''
These laptops are horrible but I've seen worse. How about lunch today? The food was okay.
'''
blob = TextBlob(text)
for sentence in blob.sentences:
print(sentence.sentiment.polarity)
# -0.7
# 0.0
# 0.5
It uses the nltk library to determine the polarity - which is a float measure ranging from -1 to 1 for the sentiment. Neutral sentences have zero polarity. You should be able to get the same measure directly from nltk.
Vader is a rule-based sentiment analysis tool that works well for social media texts as well regular texts.
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()
def print_sentiment_scores(tweets):
vadersenti = analyser.polarity_scores(tweets)
return pd.Series([vadersenti['pos'], vadersenti['neg'], vadersenti['neu'], vadersenti['compound']])
text = 'This goes beyond party lines. Separating families betrays our values as Texans, Americans and fellow human beings'
print_sentiment_scores(text)
The results are:
0 0.2470
1 0.0000
2 0.7530
3 0.5067
The compound score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence. Calling it a 'normalized, weighted composite score' is accurate
Though positive sentiment is derived with the compound score >= 0.05, we always have an option to determine the positive, negative & neutrality of the sentence, by changing these scores
I personally find Vader Sentiment to figure out the sentiment based on the emotions, special characters, emojis very well.

Resources