how tfidf value is used in k-means clustering - python-3.x

I am using K-means clustering with TF-IDF using sckit-learn library. I understand that K-means uses distance to create clusters and the distance is represented in (x axis value, y axis value) but the tf-idf is a single numerical value. My question is how is this tf-idf value converted into (x,y) value by K-means clustering.

TF-IDF isn't a single value (i.e. scalar). For every document, it returns a vector where each value in the vector corresponds to each word in the vocabulary.
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from scipy.sparse.csr import csr_matrix
sent1 = "the quick brown fox jumps over the lazy brown dog"
sent2 = "mr brown jumps over the lazy fox"
corpus = [sent1, sent2]
vectorizer = TfidfVectorizer(input=corpus)
X = vectorizer.fit_transform(corpus)
print(X.todense())
[out]:
matrix([[0.50077266, 0.35190925, 0.25038633, 0.25038633, 0.25038633,
0. , 0.25038633, 0.35190925, 0.50077266],
[0.35409974, 0. , 0.35409974, 0.35409974, 0.35409974,
0.49767483, 0.35409974, 0. , 0.35409974]])
It returns a 2-D matrix where the rows represents the sentences and the columns represent the vocabulary.
>>> vectorizer.vocabulary_
{'the': 8,
'quick': 7,
'brown': 0,
'fox': 2,
'jumps': 3,
'over': 6,
'lazy': 4,
'dog': 1,
'mr': 5}
So when K-means tries to find the distance/similarity between two documents, it's performing the similarity between two rows in the matrix. E.g. assuming the similarity is just the dot product between two rows:
import numpy as np
vector1 = X.todense()[0]
vector2 = X.todense()[1]
float(np.dot(vector1, vector2.T))
[out]:
0.7092938737640962
Chris Potts has a nice tutorial on how vector space models like TF-IDF one is created http://web.stanford.edu/class/linguist236/materials/ling236-handout-05-09-vsm.pdf

Related

BERT with WMD distance for sentence similarity

I have tried to calculate the similarity between the two sentences using BERT and word mover distance (WMD). I am unable to find the correct formula for WMD in python. Also tried the WMD python library but it uses the word2vec model for embedding. Kindly help to solve the below problem to get the similarity score using WMD.
sentence_obama = 'Obama speaks to the media in Illinois'
sentence_president = 'The president greets the press in Chicago'
sentence_obama = sentence_obama.lower().split()
sentence_president = sentence_president.lower().split()
#Importing bert for creating an embedding
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/bert-base-nli-mean-tokens')
#creating an embedding of both sentences
sentence_embeddings1 = model.encode(sentence_obama)
sentence_embeddings2 = model.encode(sentence_president)
distance = WMD(sentence_embeddings1, sentence_embeddings2)
print(distance)
Generally speaking, Word Mover Distance (based on Earth Mover Distance) requires a representation which each feature is associated with weight (or density). For examples bag-of-word representation of sentences with histogram of words.
Intuitively, EMD measures the cost of moving wights (dirt) in a histogram representation of features knowing the ground distance between each feature. With words as features, word vectors provide a distance measure between words, and then EMD can become WMD with word-histograms.
There are two issues with using WMD on BERT embeddings:
BERT embeddings provide contextual representation of sub-words and the sentence (representation of of a subword changes in different context).
There is no measure of density or weight on words and sub-words other than the attention mask on tokens.
The most simple and effective sentence similarity measure with BERT is based on the distance between [CLS] vectors of two sentences (the first vectors in the last hidden layers: the sentence vectors).
With all that said, I will try to find alternative ways to use WMD using pyemd module as in this Gensim implementation of WMD.
To measure which solution actually works, I will evaluate different solutions on this sentence similarity dataset in English.
import datasets
dataset = datasets.load_dataset('stsb_multi_mt', 'en')
Instead of sentence_transformers module, I use the main huggingface transformers. For simplicity I will use the following function to get tokens and sentence emebdedding for a given string:
from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')
def encode(sent):
inp = tokenizer(sent, return_tensors='pt')
out = model(**inp)
out = out.last_hidden_state[0].detach().numpy()
return out
Do not forget to import these modules as well:
import numpy as np
from pyemd import emd
from scipy.spatial.distance import cdist
from scipy.stats import spearmanr
We use cdist to measure vector distances, and Spearman's rank-order correlation (spearmanr) to compare our predicted similarity measure with the human judgments.
true_scores = []
pred_cls_scores = []
for item in tqdm(dataset['test']):
sent1 = encode(item['sentence1'])
sent2 = encode(item['sentence2'])
true_scores.append(item['similarity_score'])
pred_cls_scores.append(cdist(sent1[:1], sent2[:1])[0, 0])
spearmanr(true_scores, pred_cls_scores)
# SpearmanrResult(correlation=-0.737203146420342, pvalue=1.0236865615739037e-236)
Spearman's rho=0.737 is quite high!
The original post proposes to represent sentences with vectors of words based on white-space tokenization, run WMD over such representation. Here is an implementation of WMD based on EMD module similar to Gensim:
def wmdistance(sent1, sent2):
words1 = sent1.split()
words2 = sent2.split()
embs1 = np.array([encode(word)[0] for word in words1])
embs2 = np.array([encode(word)[0] for word in words2])
vocab_freq = Counter(words1 + words2)
vocab_indices = {w:idx for idx, w in enumerate(vocab_freq)}
sent1_indices = [vocab_indices[w] for w in words1]
sent2_indices = [vocab_indices[w] for w in words2]
vocab_len = len(vocab_freq)
# Compute distance matrix.
distance_matrix = np.zeros((vocab_len, vocab_len), dtype=np.double)
distance_matrix[np.ix_(sent1_indices, sent2_indices)] = cdist(embs1, embs2)
if abs((distance_matrix).sum()) < 1e-8:
# `emd` gets stuck if the distance matrix contains only zeros.
logger.info('The distance matrix is all zeros. Aborting (returning inf).')
return float('inf')
def nbow(sent):
d = np.zeros(vocab_len, dtype=np.double)
nbow = [(vocab_indices[w], vocab_freq[w]) for w in sent]
doc_len = len(sent)
for idx, freq in nbow:
d[idx] = freq / float(doc_len) # Normalized word frequencies.
return d
# Compute nBOW representation of documents. This is what pyemd expects on input.
d1 = nbow(words1)
d2 = nbow(words2)
# Compute WMD.
return emd(d1, d2, distance_matrix)
The spearman correlations are positive but not as high as the standard solution above.
pred_wmd_scores = []
for item in tqdm(dataset['test']):
pred_wmd_scores.append(wmdistance(item['sentence1'], item['sentence2']))
spearmanr(true_scores, pred_wmd_scores)
# SpearmanrResult(correlation=-0.4279390535806689, pvalue=1.6453234927014767e-62)
Perhaps, rho=0.428 is not too low for word-vector representations but it is quite low.
There are also other alternative ways to use EMD on [CLS] vectors. In order to run EMD, we need ground distances between features of the vector. So, one alternative solution is to map embeddings onto a new vector space which [CLS] vectors express weight of more meaningful features. For example, we can create a list of sentence vectors as components of the vector space. Then map the sentence vectors onto the component space, where each sentence is represented with a vector of component weight. The distance between components is measurable in the original embedding space:
def emdistance(embs1, embs2, components):
distance_matrix = cdist(components, components, metric='cosine')
sent_vec1 = 1-cdist(components, embs1[:1], metric='cosine')[:, 0]
sent_vec2 = 1-cdist(components, embs2[:1], metric='cosine')[:, 0]
return emd(sent_vec1, sent_vec2, distance_matrix)
Perhaps it is possible for some applications to find defining sentences as components, here I just sample 20 random sentences to test this:
n = 20
indices = np.arange(len(dataset['train']))
np.random.shuffle(indices)
random_sentences = [dataset['train'][int(idx)]['sentence1'] for idx in indices[:n]]
random_components = np.array([encode(sent)[0] for sent in random_sentences])
pred_emd_scores = []
for item in tqdm(dataset['test']):
sent1 = encode(item['sentence1'])
sent2 = encode(item['sentence2'])
pred_emd_scores.append(emdistance(sent1, sent2, random_components))
spearmanr(true_scores, pred_emd_scores)
#SpearmanrResult(correlation=-0.5347151444976767, pvalue=8.092612264709952e-103)
With 20 random sentences as components still rho=0.534 is a better score than bag of word rho=0.428.

Perform matrix multiplication with cosine similarity function

I have two lists:
list_1 = [['flavor', 'flavors', 'fruity_flavor', 'taste'],
['scent', 'scents', 'aroma', 'smell', 'odor'],
['mental_illness', 'mental_disorders','bipolar_disorder']
['romance', 'romances', 'romantic', 'budding_romance']]
list_2 = [['love', 'eating', 'spicy', 'hand', 'pulled', 'noodles'],
['also', 'like', 'buy', 'perfumes'],
['suffer', 'from', 'clinical', 'depression'],
['really', 'love', 'my', 'wife']]
I would like to compute the cosine similarity between the two lists above in such a way where the cosine similarity between the first sub-list in list1 and all sublists of list 2 are measured against each other. Then the same thing but with the second sub-list in list 1 and all sub-lists in list 2, etc.
The goal is to create a len(list_2) by len(list_1) matrix, and each entry in that matrix is a cosine similarity score. Currently I've done this the following way:
import gensim
import numpy as np
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('./data/GoogleNews-vectors-negative300.bin.gz', binary=True)
similarity_mat = np.zeros([len(list_2), len(list_1)])
for i, L2 in enumerate(list_2):
for j, L1 in enumerate(list_1):
similarity_mat[i, j] = model.n_similarity(L2, L1)
However, I'd like to implement this with matrix multiplication and no for loops.
My two questions are:
Is there a way to do some sort of element-wise matrix multiplication but with gensim's n_similiarity() method to generate the required matrix?
Would it be more efficient and faster using the current method or matrix multiplication?
I hope my question was clear enough, please let me know if I can clarify even further.
Here's an approach, but it's not clear from the question whether you understand the underlying mechanics of the calculation, which might be causing the block.
I've changed the input strings to give more exact word matches, and given the two strings different dimensions to make it a bit clearer:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
list_1 = [['flavor', 'flavors', 'fruity_flavor', 'taste'],
['scent', 'my', 'aroma', 'smell', 'odor'],
['mental_illness', 'mental_disorders','bipolar_disorder'],
['romance', 'romances', 'romantic', 'budding_romance']]
list_2 = [['love', 'eating', 'spicy', 'hand', 'pulled', 'noodles'],
['also', 'like', 'buy', 'perfumes'],
['suffer', 'from', 'clinical', 'depression'],
['really', 'love', 'my', 'wife'],
['flavor', 'taste', 'romantic', 'aroma', 'what']]
cnt = CountVectorizer()
# Combine each sublist into single str, and join everything into corpus
combined_lists = ([' '.join(item) for item in list_1] +
[' '.join(item) for item in list_2])
count_matrix = cnt.fit_transform(combined_lists).toarray()
# Split them again into list_1 and list_2 word counts
count_matrix_1 = count_matrix[:len(list_1),]
count_matrix_2 = count_matrix[len(list_1):,]
match_matrix = np.matmult(count_matrix_1, count_matrix_2.T)
Output of match_matrix:
array([[0, 0, 0, 0, 2],
[0, 0, 0, 1, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 1]], dtype=int64)
You can see that the 1st string in list_1 has 2 matches with the 5th string in list_2, and so on.
So the first part of the calculation (the dot product) has been calculated. Now we need the magnitudes:
magnitudes = np.array([np.linalg.norm(count_matrix[i,:])
for i in range(len(count_matrix))])
Now we can use matrix multiplication to turn that into a matrix of divisors (we need to reshape magnitudes into n x 1 and 1 x n matrices for this to produce an n x n matrix:
divisor_matrix = np.matmul(magnitudes.reshape(len(magnitudes),1),
magnitudes.reshape(1,len(magnitudes)))
Now since we didn't compare every single sublist, but only the list_1 with the list_2 sublists, we need to take a subsection of this divisor matrix to get the right magnitudes:
divisor_matrix = divisor_matrix[:len(list_1), len(list_1):]
Output:
array([[4.89897949, 4. , 4. , 4. , 4.47213595],
[5.47722558, 4.47213595, 4.47213595, 4.47213595, 5. ],
[4.24264069, 3.46410162, 3.46410162, 3.46410162, 3.87298335],
[4.89897949, 4. , 4. , 4. , 4.47213595]])
Now we can calculate the final matrix of cosine similarity scores:
cos_sim = match_matrix / divisor_matrix
Output:
array([[0. , 0. , 0. , 0. , 0.4472136],
[0. , 0. , 0. , 0.2236068, 0.2 ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.2236068]])
Note these scores differ from the example given, since in the example every cosine similarity score would be 0.
There are two problems in code, the second last and last line.
import gensim
import numpy as np
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('/root/input/GoogleNews-vectors-negative300.bin.gz', binary=True)
similarity_mat = np.zeros([len(list_2), len(list_1)])
for i, L2 in enumerate(list_2):
for j, L1 in enumerate(list_1):
similarity_mat[i, j] = model.n_similarity(L2, L1)
Answers to you questions:
1. You are already using a direct function to calculate the similarity between two sentences(L1 and L2) which are first converted to two vectors and then cosine similarity is calculated of those two vectors. Everything is already done inside the n_similarity() so you can't do any kind of matrix multiplication.
If you want to do your own matrix multiplication then instead of directly using n_similarity() calculates the vectors of the sentences and then apply matrix multiplication while calculating cosine similarity.
2. As I said in (1) that everything is done in n_similarity() and creators of gensim takes care of the efficiency when writing the libraries so any other multiplication method will most likely not make a difference.

keras pre-processing of text using one_hot class

I came across this code while learning keras online.
from keras.preprocessing.text import one_hot
from keras.preprocessing.text import text_to_word_sequence
text = 'One hot encoding in Keras'
tokens = text_to_word_sequence(text)
length = len(tokens)
one_hot(text, length)
This returns the intergers like this...
[3, 1, 1, 2, 3]
I did not understand why and how does unique words return duplicate numbers. For e.g. 3 and 1 is repeated even if the words in the text are unique.
From the documentation of one_hot it is described how it is a wrapper of hashing_trick:
This is a wrapper to the hashing_trick function using hash as the hashing function; unicity of word to index mapping non-guaranteed.
From the documentation of hasing_trick:
Two or more words may be assigned to the same index, due to possible collisions by the hashing function. The probability of a collision is in relation to the dimension of the hashing space and the number of distinct objects.
Since hashing is used there is a probability that different words will be hashed to the same index. The probability of a non-unique hash is proportional to the vocabulary size selected.
It is suggested by Jason Brownlee Jason Brownlee to use a vocabulary size 25% larger than the word size to increase the uniqueness of the hashes.
Following Jason Brownlee suggestion in you case results in:
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.text import text_to_word_sequence
from tensorflow.random import set_random_seed
import math
set_random_seed(1)
text = 'One hot encoding in Keras'
tokens = text_to_word_sequence(text)
length = len(tokens)
print(one_hot(text, math.ceil(length*1.25)))
which returns the integers
[3, 4, 5, 1, 6]

Using Kmeans to cluster small phrases in Spark

I am having a list of words/phrases(around a million) that I would like to cluster. I am assuming that its the following list:
a_list = [u'java',u'javascript',u'python dev',u'pyspark',u'c ++']
a_list_rdd = sc.parallelize(a_list)
and I follow this procedure:
Using a string distance(lets say jaro winkler metric) i compute all the distance between the list of the words which will create a matrix of 5x5 with the diagonal being ones, as it computes the distances between itself. And to compute all the distances I broadcast the whole list. So:
a_list_rdd_broadcasted = sc.broadcast(a_list_rdd.collect())
and the string distances computations:
import jaro
def ComputeStringDistance(phrase,phrase_list_broadcasted):
keyvalueDistances = []
for value in phrase_list_broadcasted:
distanceValue = jaro.jaro_winkler_metric(phrase,value)
keyvalueDistances.append(distanceValue)
return (array(keyvalueDistances))
string_distances = (a_list_rdd
.map(lambda phrase:ComputeStringDistance(phrase,a_list_rdd_broadcasted.value))
)
and using K means for clustering:
from pyspark.mllib.clustering import KMeans, KMeansModel
clusters = KMeans.train(string_distances, 3 , maxIterations=10,
runs=10, initializationMode="random")
PredictGroup = string_distances.map(lambda point:clusters.predict(point)).zip(a_list_rdd)
and the results:
PredictGroup.collect()
ut[73]:
[(0, u'java'),
(0, u'javascript'),
(2, u'python'),
(2, u'pyspark'),
(1, u'c ++')]
not bad! But what happens if I have 1 million observations and an estimation of around 10000 clusters? Reading some posts large number of clusters is really expensive. Is there a way to overpass this issue?
k-means foes not operate on a distance matrix (distance matrixes also do not scale).
K-means also does not work with arbitrary distance functions.
It's about minimizing variance, the sum-of-squared-deviations-from-the-mean.
What you are doing works because it's halfway to spectral clustering, but it's neither k-means used correctly, nor spectral clustering.

Does TfidfVectorizer keep order of the features?

I wonder if TfidfVectorizer keeps the order of the features when transforming documents using scikit-learn. Here is what I am doing:
from sklearn.feature_exteraction.text import TfidfVectorizer
corpus = ['this movie is cool', 'I love this book']
vec = TfidfVectorizer()
X = vec.fit_tranform(corpus)
joblib.dump(vec, './vec')
doc = 'What are the coolest movies in 2015'
vec = joblib.load('./vec')
X_test = vec.transform([doc])
Now, my question is that are the feature entries in X and X_test in the same order?
Yes. As when you call fit(), it creates a vocabulary dictionary from text strings to column indexes. It uses that to transform additional data sets. This is preserved in any serialization and deserialization.
vec.vocabulary_
> {u'book': 0, u'cool': 1, u'is': 2, u'love': 3, u'movie': 4, u'this': 5}

Resources