probability distribution of topics using NMF - scikit-learn

I use the following code to do the topic modeling on my documents:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
tfidf_vectorizer = TfidfVectorizer(tokenizer=tokenize, max_df=0.85, min_df=3, ngram_range=(1,5))
tfidf = tfidf_vectorizer.fit_transform(docs)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
from sklearn.decomposition import NMF
no_topics = 50
%time nmf = NMF(n_components=no_topics, random_state=11, init='nndsvd').fit(tfidf)
topic_pr= nmf.transform(tfidf)
I thought topic_pr gives me the probability distribution of different topics for each document. In other words, I expected that the numbers in the output(topic_pr) would be probabilities that the document in row X belongs to each of the 50 topics in model. But, the numbers do not add to 1. Are these really probabilities? If no, is there a way to convert them to probabilities?
Thanks

NMF returns a non-negative factorization, doesn't have anything to do with probabilities (to the best of my knowledge). If you just want probabilities you could transform the output of NMF (L1 normalization)
probs = topic_pr / topic_pr.sum(axis=1, keepdims=True)
This assumes that topic_pr is a non-negative matrix, which is true in your case.
EDIT: Apparently there is a probabilistic version of NMF.
Quoting sklearn's documetation:
Non-negative Matrix Factorization is applied with two different objective functions: the Frobenius norm, and the generalized Kullback-Leibler divergence. The latter is equivalent to Probabilistic Latent Semantic Indexing.
To apply the latter, which is what you seem to need, from the same link:
lda = LatentDirichletAllocation(n_components=no_topics, max_iter=5)
topic_pr = lda.fit_transform(tfidf)

Related

BERT with WMD distance for sentence similarity

I have tried to calculate the similarity between the two sentences using BERT and word mover distance (WMD). I am unable to find the correct formula for WMD in python. Also tried the WMD python library but it uses the word2vec model for embedding. Kindly help to solve the below problem to get the similarity score using WMD.
sentence_obama = 'Obama speaks to the media in Illinois'
sentence_president = 'The president greets the press in Chicago'
sentence_obama = sentence_obama.lower().split()
sentence_president = sentence_president.lower().split()
#Importing bert for creating an embedding
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/bert-base-nli-mean-tokens')
#creating an embedding of both sentences
sentence_embeddings1 = model.encode(sentence_obama)
sentence_embeddings2 = model.encode(sentence_president)
distance = WMD(sentence_embeddings1, sentence_embeddings2)
print(distance)
Generally speaking, Word Mover Distance (based on Earth Mover Distance) requires a representation which each feature is associated with weight (or density). For examples bag-of-word representation of sentences with histogram of words.
Intuitively, EMD measures the cost of moving wights (dirt) in a histogram representation of features knowing the ground distance between each feature. With words as features, word vectors provide a distance measure between words, and then EMD can become WMD with word-histograms.
There are two issues with using WMD on BERT embeddings:
BERT embeddings provide contextual representation of sub-words and the sentence (representation of of a subword changes in different context).
There is no measure of density or weight on words and sub-words other than the attention mask on tokens.
The most simple and effective sentence similarity measure with BERT is based on the distance between [CLS] vectors of two sentences (the first vectors in the last hidden layers: the sentence vectors).
With all that said, I will try to find alternative ways to use WMD using pyemd module as in this Gensim implementation of WMD.
To measure which solution actually works, I will evaluate different solutions on this sentence similarity dataset in English.
import datasets
dataset = datasets.load_dataset('stsb_multi_mt', 'en')
Instead of sentence_transformers module, I use the main huggingface transformers. For simplicity I will use the following function to get tokens and sentence emebdedding for a given string:
from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')
def encode(sent):
inp = tokenizer(sent, return_tensors='pt')
out = model(**inp)
out = out.last_hidden_state[0].detach().numpy()
return out
Do not forget to import these modules as well:
import numpy as np
from pyemd import emd
from scipy.spatial.distance import cdist
from scipy.stats import spearmanr
We use cdist to measure vector distances, and Spearman's rank-order correlation (spearmanr) to compare our predicted similarity measure with the human judgments.
true_scores = []
pred_cls_scores = []
for item in tqdm(dataset['test']):
sent1 = encode(item['sentence1'])
sent2 = encode(item['sentence2'])
true_scores.append(item['similarity_score'])
pred_cls_scores.append(cdist(sent1[:1], sent2[:1])[0, 0])
spearmanr(true_scores, pred_cls_scores)
# SpearmanrResult(correlation=-0.737203146420342, pvalue=1.0236865615739037e-236)
Spearman's rho=0.737 is quite high!
The original post proposes to represent sentences with vectors of words based on white-space tokenization, run WMD over such representation. Here is an implementation of WMD based on EMD module similar to Gensim:
def wmdistance(sent1, sent2):
words1 = sent1.split()
words2 = sent2.split()
embs1 = np.array([encode(word)[0] for word in words1])
embs2 = np.array([encode(word)[0] for word in words2])
vocab_freq = Counter(words1 + words2)
vocab_indices = {w:idx for idx, w in enumerate(vocab_freq)}
sent1_indices = [vocab_indices[w] for w in words1]
sent2_indices = [vocab_indices[w] for w in words2]
vocab_len = len(vocab_freq)
# Compute distance matrix.
distance_matrix = np.zeros((vocab_len, vocab_len), dtype=np.double)
distance_matrix[np.ix_(sent1_indices, sent2_indices)] = cdist(embs1, embs2)
if abs((distance_matrix).sum()) < 1e-8:
# `emd` gets stuck if the distance matrix contains only zeros.
logger.info('The distance matrix is all zeros. Aborting (returning inf).')
return float('inf')
def nbow(sent):
d = np.zeros(vocab_len, dtype=np.double)
nbow = [(vocab_indices[w], vocab_freq[w]) for w in sent]
doc_len = len(sent)
for idx, freq in nbow:
d[idx] = freq / float(doc_len) # Normalized word frequencies.
return d
# Compute nBOW representation of documents. This is what pyemd expects on input.
d1 = nbow(words1)
d2 = nbow(words2)
# Compute WMD.
return emd(d1, d2, distance_matrix)
The spearman correlations are positive but not as high as the standard solution above.
pred_wmd_scores = []
for item in tqdm(dataset['test']):
pred_wmd_scores.append(wmdistance(item['sentence1'], item['sentence2']))
spearmanr(true_scores, pred_wmd_scores)
# SpearmanrResult(correlation=-0.4279390535806689, pvalue=1.6453234927014767e-62)
Perhaps, rho=0.428 is not too low for word-vector representations but it is quite low.
There are also other alternative ways to use EMD on [CLS] vectors. In order to run EMD, we need ground distances between features of the vector. So, one alternative solution is to map embeddings onto a new vector space which [CLS] vectors express weight of more meaningful features. For example, we can create a list of sentence vectors as components of the vector space. Then map the sentence vectors onto the component space, where each sentence is represented with a vector of component weight. The distance between components is measurable in the original embedding space:
def emdistance(embs1, embs2, components):
distance_matrix = cdist(components, components, metric='cosine')
sent_vec1 = 1-cdist(components, embs1[:1], metric='cosine')[:, 0]
sent_vec2 = 1-cdist(components, embs2[:1], metric='cosine')[:, 0]
return emd(sent_vec1, sent_vec2, distance_matrix)
Perhaps it is possible for some applications to find defining sentences as components, here I just sample 20 random sentences to test this:
n = 20
indices = np.arange(len(dataset['train']))
np.random.shuffle(indices)
random_sentences = [dataset['train'][int(idx)]['sentence1'] for idx in indices[:n]]
random_components = np.array([encode(sent)[0] for sent in random_sentences])
pred_emd_scores = []
for item in tqdm(dataset['test']):
sent1 = encode(item['sentence1'])
sent2 = encode(item['sentence2'])
pred_emd_scores.append(emdistance(sent1, sent2, random_components))
spearmanr(true_scores, pred_emd_scores)
#SpearmanrResult(correlation=-0.5347151444976767, pvalue=8.092612264709952e-103)
With 20 random sentences as components still rho=0.534 is a better score than bag of word rho=0.428.

Sklearn TruncatedSVD not showing explained variance ration in descending order, or first number means something else?

from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
digits = datasets.load_digits()
X = digits.data
X = X - X.mean() # centering the data
#### svd
svd = TruncatedSVD(n_components=5)
svd.fit(X)
print(svd.explained_variance_ration)
#### PCA
pca = PCA(n_components=5)
pca.fit(X)
print(pca.explained_variance_ratio_)
svd output is:
array([0.02049911, 0.1489056 , 0.13534811, 0.11738598, 0.08382797])
pca output is:
array([0.14890594, 0.13618771, 0.11794594, 0.08409979, 0.05782415])
is there a bug in the TruncatedSVD implementation? or why is the first explained variance (0.02...) behaving like this? or what is the meaning
Summary:
That is because TruncatedSVD and PCA use different SVD functions!.
Note: Your case is due to Reason 2 below, yet I included another reason for future readers.
Details:
Reason 1: The solver set by user in each algorithm, is different:
PCA internally uses scipy.linalg.svd which sorts singular values, hence the explained_variance_ratio_ is sorted.
Part of Scikit Implementation of PCA:
# Center data
U, S, Vt = linalg.svd(X, full_matrices=False)
# flip eigenvectors' sign to enforce deterministic output
U, Vt = svd_flip(U, Vt)
components_ = Vt
# Get variance explained by singular values
explained_variance_ = (S ** 2) / (n_samples - 1)
total_var = explained_variance_.sum()
explained_variance_ratio_ = explained_variance_ / total_var
Screenshot from the above-mentioned scipy.linalg.svd link:
On the other hand, TruncatedSVD uses scipy.sparse.linalg.svds which relies on the ARPACK solver for decomposition.
Screenshot from the above-mentioned scipy.sparse.linalg.svds link:
Reason 2: The TruncatedSVD operates differently compared to PCA:
In your case you chose randomized as a solver (which is set by default) in both algorithms, yet you obtained different results with regards to the order of the variance.
That is because in PCA, the variance is obtained from the actual singular values (called Sigma or S in Scikit-Learn implementation), which are already sorted:
On the other hand, the variance in TruncatedSVD is obtained from X_transformed which results from multiplying the data matrix by the components. The latter does not necessarily preserve order because data are not centered, nor is it the purpose of TruncatedSVD which it is used in first place for sparse matrices:
Now if you center your data, you will get them sorted (note that you did not center data properly, because centering requires dividing by standard deviation):
from sklearn import datasets
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
digits = datasets.load_digits()
X = digits.data
sc = StandardScaler()
X = sc.fit_transform(X)
### SVD
svd = TruncatedSVD(n_components=5, algorithm='randomized', random_state=2021)
svd.fit(X)
print(svd.explained_variance_ratio_)
Output
[0.12033916 0.09561054 0.08444415 0.06498406 0.04860093]
Important: Further read.

PySpark: Get Threshold (cuttoff) values for each point in ROC curve

I'm starting with PySpark, building binary classification models (logistic regression), and I need to find the optimal threshold (cuttoff) point for my models.
I want to use the ROC curve to find this point, but I don't know how to extract the threshold value for each point in this curve. Is there a way to find this values?
Things I've found:
This post shows how to extract the ROC curve, but only the values for the TPR and FPR. It's useful for plotting and for selecting the optimal point, but I can't find the threshold value.
I know I can find the threshold values for each point in the ROC curve using H2O (I've done it before), but I'm working on Pyspark.
Here is a post describing how to do it with R... but, again, I need to do it with Pyspark
Other facts
I'm using Apache Spark 2.4.0.
I'm working with Data Frames (I really don't know - yet - how to work with RDDs, but I'm not afraid to learn ;) )
If you specifically need to generate ROC curves for different thresholds, one approach could be to generate a list of threshold values you're interested in and fit/transform on your dataset for each threshold. Or you could manually calculate the ROC curve for each threshold point using the probability field in the response from model.transform(test).
Alternatively, you can use BinaryClassificationMetrics to extract a curve plotting various metrics (F1 score, precision, recall) by threshold.
Unfortunately it appears the PySpark version doesn't implement most of the methods the Scala version does, so you'd need to wrap the class to do it in Python.
For example:
from pyspark.mllib.evaluation import BinaryClassificationMetrics
# Scala version implements .roc() and .pr()
# Python: https://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/common.html
# Scala: https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.html
class CurveMetrics(BinaryClassificationMetrics):
def __init__(self, *args):
super(CurveMetrics, self).__init__(*args)
def _to_list(self, rdd):
points = []
# Note this collect could be inefficient for large datasets
# considering there may be one probability per datapoint (at most)
# The Scala version takes a numBins parameter,
# but it doesn't seem possible to pass this from Python to Java
for row in rdd.collect():
# Results are returned as type scala.Tuple2,
# which doesn't appear to have a py4j mapping
points += [(float(row._1()), float(row._2()))]
return points
def get_curve(self, method):
rdd = getattr(self._java_model, method)().toJavaRDD()
return self._to_list(rdd)
Usage:
import matplotlib.pyplot as plt
preds = predictions.select('label','probability').rdd.map(lambda row: (float(row['probability'][1]), float(row['label'])))
# Returns as a list (false positive rate, true positive rate)
points = CurveMetrics(preds).get_curve('roc')
plt.figure()
x_val = [x[0] for x in points]
y_val = [x[1] for x in points]
plt.title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.plot(x_val, y_val)
Results in:
Here's an example of an F1 score curve by threshold value if you aren't married to ROC:
One way is to use sklearn.metrics.roc_curve.
First use your fitted model to make predictions:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(labelCol="label", featuresCol="features")
model = lr.fit(trainingData)
predictions = model.transform(testData)
Then collect your scores and labels1:
preds = predictions.select('label','probability')\
.rdd.map(lambda row: (float(row['probability'][1]), float(row['label'])))\
.collect()
Now transform preds to work with roc_curve
from sklearn.metrics import roc_curve
y_score, y_true = zip(*preds)
fpr, tpr, thresholds = roc_curve(y_true, y_score, pos_label = 1)
Notes:
I am not 100% certain that the probabilities vector will always be ordered such that the positive label will be at index 1. However in a binary classification problem, you'll know right away if your AUC is less than 0.5. In that case, just take 1-p for the probabilities (since the class probabilities sum to 1).

Cosine Similarity score in scikit learn for two different vectorization technique is same

I am recently working on an assignment where the task is to use 20_newgroups dataset and use 3 different vectorization technique (Bag of words, TF, TFIDF) to represent documents in vector format and then trying to analyze the difference between average cosine similarity between each class in 20_Newsgroups data set. So here is what I am trying to do in python. I am reading data and passing it to sklearn.feature_extraction.text.CountVectorizer class's fit() and transform() function for Bag of Words technique and TfidfVectorizer for TFIDF technique.
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity,cosine_distances
import numpy
import math
import csv
===============================================================================================================================================
categories = ['alt.atheism','comp.graphics','comp.os.ms-windows.misc','comp.sys.ibm.pc.hardware','comp.sys.mac.hardware', 'comp.windows.x','misc.forsale','rec.autos','rec.motorcycles','rec.sport.baseball','rec.sport.hockey',
'sci.crypt','sci.electronics','sci.med','sci.space','soc.religion.christian','talk.politics.guns',
'talk.politics.mideast','talk.politics.misc','talk.religion.misc']
twenty_newsgroup = fetch_20newsgroups(subset='all',remove=('headers', 'footers', 'quotes'),shuffle=True, random_state=42)
dataset_groups = []
for group in range(0,20):
category = []
category.append(categories[group])
dataset_groups.append(fetch_20newsgroups(subset='all',remove=('headers','footers','quotes'),shuffle=True,random_state=42,categories=category))
===============================================================================================================================================
bag_of_word_vect = CountVectorizer(stop_words='english',analyzer='word') #,min_df = 0.09
bag_of_word_vect = bag_of_word_vect.fit(twenty_newsgroup.data,twenty_newsgroup.target)
datamatrix_bow_groups = []
for group in dataset_groups:
datamatrix_bow_groups.append(bag_of_word_vect.transform(group.data))
similarity_matrix = []
for i in range(0,20):
means = []
for j in range(i,20):
result_of_group_ij = cosine_similarity(datamatrix_bow_groups[i], datamatrix_bow_groups[j])
means.append(numpy.mean(result_of_group_ij))
similarity_matrix.append(means)
===============================================================================================================================================
tf_vectorizer = TfidfVectorizer(stop_words='english',analyzer='word',use_idf=False) #,sublinear_tf=True
tf_vectorizer = tf_vectorizer.fit(twenty_newsgroup.data)
datamatrix_tf_groups = []
for group in dataset_groups:
datamatrix_tf_groups.append(tf_vectorizer.transform(group.data))
similarity_matrix = []
for i in range(0,20):
means = []
for j in range(i,20):
result_of_group_ij = cosine_similarity(datamatrix_tf_groups[i], datamatrix_tf_groups[j])
means.append(numpy.mean(result_of_group_ij))
similarity_matrix.append(means)
Both should technically give different similarity_matrix but they are yeilding the same. More precisiosly tf_vectorizer should create similarity_matrix which have values more closed to 1.
The problem here is, Vector created by both technique for the same document of the same class for example (alt.atheism) is different and it should be. but when I calculating a similarity score between documents of one class and another class, Cosine similarity scorer giving me same value. If we understand theoretically then TFIDF is representing a document in a more finer sense in vector space so cosine value should be more near to 1 then what I get from BAG OF WORD technique right? But it is giving same similarity score. I tried by printing values of matrices created by BOW & TFIDF technique. It would a great help if somebody can give me a good reason to resolve this issue or strong argument in support what is happening?
I am new to this platform so please ignore any mistakes and let me know if you need more info.
Thanks & Regards,
Darshan Sonagara
The problem is this line in your code.
tf_vectorizer = TfidfVectorizer(stop_words='english',analyzer='word',use_idf=False) #,sublinear_tf=True
You have set use_idf to False. This means the inverse document frequency is not calculated.So only the term frequency is calculated. Basicaly you are using the TfidfVectorizer like a CountVectorizer. Hence the output of both is the same: resulting in the same cosine distances.
using tf_vectorizer = TfidfVectorizer(stop_words='english',analyzer='word',use_idf=True) Will result in a cosine similarity matrix for tfidf that is different from the countvectorizer.

Role of class_weight in loss functions for linearSVC and LogisticRegression

I am trying to figure out what exactly the loss function formula is and how I can manually calculate it when class_weight='auto' in case of svm.svc, svm.linearSVC and linear_model.LogisticRegression.
For balanced data, say you have a trained classifier: clf_c. Logistic loss should be (am I correct?):
def logistic_loss(x,y,w,b,b0):
'''
x: nxp data matrix where n is number of data points and p is number of features.
y: nx1 vector of true labels (-1 or 1).
w: nx1 vector of weights (vector of 1./n for balanced data).
b: px1 vector of feature weights.
b0: intercept.
'''
s = y
if 0 in np.unique(y):
print 'yes'
s = 2. * y - 1
l = np.dot(w, np.log(1 + np.exp(-s * (np.dot(x, np.squeeze(b)) + b0))))
return l
I realized that logisticRegression has predict_log_proba() which gives you exactly that when data is balanced:
b, b0 = clf_c.coef_, clf_c.intercept_
w = np.ones(len(y))/len(y)
-(clf_c.predict_log_proba(x[xrange(len(x)), np.floor((y+1)/2).astype(np.int8)]).mean() == logistic_loss(x,y,w,b,b0)
Note, np.floor((y+1)/2).astype(np.int8) simply maps y=(-1,1) to y=(0,1).
But this does not work when data is imbalanced.
What's more, you expect the classifier (here, logisticRegression) to perform similarly (in terms of loss function value) when data in balance and class_weight=None versus when data is imbalanced and class_weight='auto'. I need to have a way to calculate the loss function (without the regularization term) for both scenarios and compare them.
In short, what does class_weight = 'auto' exactly mean? Does it mean class_weight = {-1 : (y==1).sum()/(y==-1).sum() , 1 : 1.} or rather class_weight = {-1 : 1./(y==-1).sum() , 1 : 1./(y==1).sum()}?
Any help is much much appreciated. I tried going through the source code, but I am not a programmer and I am stuck.
Thanks a lot in advance.
class_weight heuristics
I am a bit puzzled by your first proposition for the class_weight='auto' heuristic, as:
class_weight = {-1 : (y == 1).sum() / (y == -1).sum(),
1 : 1.}
is the same as your second proposition if we normalize it so that the weights sum to one.
Anyway to understand what class_weight="auto" does, see this question:
what is the difference between class weight = none and auto in svm scikit learn.
I am copying it here for later comparison:
This means that each class you have (in classes) gets a weight equal
to 1 divided by the number of times that class appears in your data
(y), so classes that appear more often will get lower weights. This is
then further divided by the mean of all the inverse class frequencies.
Note how this is not completely obvious ;).
This heuristic is deprecated and will be removed in 0.18. It will be replaced by another heuristic, class_weight='balanced'.
The 'balanced' heuristic weighs classes proportionally to the inverse of their frequency.
From the docs:
The "balanced" mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data:
n_samples / (n_classes * np.bincount(y)).
np.bincount(y) is an array with the element i being the count of class i samples.
Here's a bit of code to compare the two:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.utils import compute_class_weight
n_classes = 3
n_samples = 1000
X, y = make_classification(n_samples=n_samples, n_features=20, n_informative=10,
n_classes=n_classes, weights=[0.05, 0.4, 0.55])
print("Count of samples per class: ", np.bincount(y))
balanced_weights = n_samples /(n_classes * np.bincount(y))
# Equivalent to the following, using version 0.17+:
# compute_class_weight("balanced", [0, 1, 2], y)
print("Balanced weights: ", balanced_weights)
print("'auto' weights: ", compute_class_weight("auto", [0, 1, 2], y))
Output:
Count of samples per class: [ 57 396 547]
Balanced weights: [ 5.84795322 0.84175084 0.60938452]
'auto' weights: [ 2.40356854 0.3459682 0.25046327]
The loss functions
Now the real question is: how are these weights used to train the classifier?
I don't have a thorough answer here unfortunately.
For SVC and linearSVC the docstring is pretty clear
Set the parameter C of class i to class_weight[i]*C for SVC.
So high weights mean less regularization for the class and a higher incentive for the svm to classify it properly.
I do not know how they work with logistic regression. I'll try to look into it but most of the code is in liblinear or libsvm and I'm not too familiar with those.
However, note that the weights in class_weight do not influence directly methods such as predict_proba. They change its ouput because the classifier optimizes a different loss function.
Not sure this is clear, so here's a snippet to explain what I mean (you need to run the first one for the imports and variable definition):
lr = LogisticRegression(class_weight="auto")
lr.fit(X, y)
# We get some probabilities...
print(lr.predict_proba(X))
new_lr = LogisticRegression(class_weight={0: 100, 1: 1, 2: 1})
new_lr.fit(X, y)
# We get different probabilities...
print(new_lr.predict_proba(X))
# Let's cheat a bit and hand-modify our new classifier.
new_lr.intercept_ = lr.intercept_.copy()
new_lr.coef_ = lr.coef_.copy()
# Now we get the SAME probabilities.
np.testing.assert_array_equal(new_lr.predict_proba(X), lr.predict_proba(X))
Hope this helps.

Resources