Computing cosine similarity using Python - text

I have written the following code to compute the cosine similarity between a number of preprocessed document (stop word removal, stemming and term frequency-inverse document frequency).
print(X.shape)
similarity = []
for each in X:
similarity.append(cosine_similarity(X[i:1], X))
print(cosine_similarity(X[i:1], X))
i = i+1
However, when I run it I receive this:
(2235, 7791)
[[ 1. 0.01490594 0.11752643 ..., 0.00941571 0.03652551
0.01239277]]
Traceback (most recent call last):
File "...", line 83, in <module>
similarity.append(cosine_similarity(X[i:1], X))
File "/Users/.../anaconda/lib/python3.5/site-packages/sklearn/metrics/pairwise.py", line 881, in cosine_similarity
X, Y = check_pairwise_arrays(X, Y)
File "/Users/.../anaconda/lib/python3.5/site-packages/sklearn/metrics/pairwise.py", line 96, in check_pairwise_arrays
X = check_array(X, accept_sparse='csr', dtype=dtype)
File "/Users/.../anaconda/lib/python3.5/site-packages/sklearn/utils/validation.py", line 407, in check_array
context))
ValueError: Found array with 0 sample(s) (shape=(0, 7791)) while a minimum of 1 is required.
[Finished in 56.466s]

It's not clear what you're trying to achieve. You're taking a cosine similarity between a slice of the matrix X and the entire matrix. The slice is empty except when i == 0. Your for statement iterates through the matrix, but you never use the iteration variable each.
Cosine similarity is an operation between two vectors of equal length. For instance, you can compute the similarity between row i and row j with
cosine_similarity(X[i], X[j])
If you want all of the row-to-row similarities computed in a list, use a list comprehension:
similarity = [cosine_similarity(a, b) for a in X for b in X]
Does that get you moving?

Related

Python: Function Error TypeError: tuple indices must be integers or slices, not tuple & TypeError: 'int' object is not subscriptable

Have been applying below code on my function, but regularly getting an error. Please can you explain where am i going wrong
clusters = np.zeros((len(dataset),1))
def assign(centroids,dataset,clusters,k):
numOfObject=len(dataset)
#for every object in the dataset
for i in range(numOfObject):
X=dataset[i,1:-1]
#find the closest centroid
centroidOfX= -1
distanceToClosestcentroids = np.Inf
for y in range(k):
currentcentroids=centroids[y,:]
dist=distance(X,currentcentroids)
if dist<distanceToClosestcentroids:
#Found closer Centroid
distanceToClosestcentroids= dist
centroidOfX=y
#assign to X its closest centroid
clusters[i]=int(centroidOfX)
#assign((2.5),dataset,clusters,20)
assign((2,1),dataset,clusters,20)
Dont really know why i am prompted with this error
Traceback (most recent call last):
File "c:\library\K-Mean.py", line 71, in <module>
assign((2.5),dataset,clusters,20)
File "c:\library\K-Mean.py", line 62, in assign
currentcentroids=centroids[y,:]
TypeError: 'float' object is not subscriptable
PS C:\library> & "C:/Users/ASHISH SHARMA/AppData/Local/Microsoft/WindowsApps/python3.9.exe" c:/library/K-Mean.py
Traceback (most recent call last):
File "c:\library\K-Mean.py", line 71, in <module>
assign((2,1),dataset,clusters,20)
File "c:\library\K-Mean.py", line 62, in assign
currentcentroids=centroids[y,:]
TypeError: tuple indices must be integers or slices, not tuple
The code worked perfectly, the values entered was not in right order
def assign(centroids, dataset, clusters,k):
"""The Assign Function helps overall assigning the value to the functions used in the K-Means and K-Medians
Args:
centroids (NDArray(Float64)): This helps us store the value for the centroid of the cluster
dataset (NDArray): Dataset which we are using in the program to calculate K-Means and K-Medians
clusters (NDArray(Float64)): The different clusters formed by the program to classify them into different categories
k (int): The number of times we have to iterate the function
"""
numOfObjects = len(dataset)
#for every object in the dataset
for i in range(numOfObjects):
X = dataset[i, 1:-1]
#find the closest centroid
centroidsOfX = -1
distanceToClosestcentroids = np.Inf
for y in range(k):
currentcentroids = centroids[y,:]
dist = distance(X, currentcentroids)
if dist < distanceToClosestcentroids:
#Finally found closer Centroid
distanceToClosestcentroids = dist
centroidsOfX = y
#Assign to X its closest centroid
clusters[i] = int(centroidsOfX)

Calculate Batch Pairwise Sinkhorn Distance in PyTorch

I have two tensors and both are of same shape. I want to calculate pairwise sinkhorn distance using GeomLoss.
What i have tried:
import torch
import geomloss # pip install git+https://github.com/jeanfeydy/geomloss
a = torch.rand((8,4))
b = torch.rand((8,4))
geomloss.SamplesLoss('sinkhorn')(a,b)
# ^ input shape [batch, feature_dim]
# will return a scalar value
geomloss.SamplesLoss('sinkhorn')(a.unsqueeze(1),b.unsqueeze(1))
# ^ input shape [batch, n_points, feature_dim]
# will return a tensor of size [batch] of distances between a[i] and b[i] for each i
However I would like to compute pairwise distance where the resultant tensor should be of size [batch, batch]. To achieve this, I tried the following to use broadcasting:
geomloss.SamplesLoss('sinkhorn')(a.unsqueeze(0), b.unsqueeze(1))
But I got this error message:
ValueError: Samples x and y should have the same batchsize.
Since the documentation doesn't give examples on how to use the distance's forward function. Here's a way to do it, which will require you to call the distance function batch times.
We will construct the distance matrix line by line. Line i corresponds to the distances a[i]<->b[0], a[i]<->b[1], through to a[i]<->b[batch]. To do so we need to construct, for each line i, a (8x4) repeated version of tensor a[i].
This will do:
a_i = torch.stack(8*[a[i]], dim=0)
Then we calculate the distance between a[i] and each batch in b:
dist(a_i.unsqueeze(1), b.unsqueeze(1))
Having a total of batch lines we can construct our final tensor stack.
Here's the complete code:
batch = a.shape[0]
dist = geomloss.SamplesLoss('sinkhorn')
distances = [dist(torch.stack(batch*[a[i]]).unsqueeze(1), b.unsqueeze(1)) for i in range(batch)]
D = torch.stack(distances)

problem saving pre-trained fasttext vectors in "word2vec" format with _save_word2vec_format()

For a list of words I want to get their fasttext vectors and save them to a file in the same "word2vec" .txt format (word+space+vector in txt format).
This is what I did:
dict = open("word_list.txt","r") #the list of words I have
path = "cc.en.300.bin"
model = load_facebook_model(path)
vectors = []
words =[]
for word in dict:
vectors.append(model[word])
words.append(word)
vectors_array = np.array(vectors)
*I want to take the list "words" and nd.array "vectors_array" and save in the original .txt format.
I try to use the function from gensim "_save_word2vec_format":
def _save_word2vec_format(fname, vocab, vectors, fvocab=None, binary=False, total_vec=None):
"""Store the input-hidden weight matrix in the same format used by the original
C word2vec-tool, for compatibility.
Parameters
----------
fname : str
The file path used to save the vectors in.
vocab : dict
The vocabulary of words.
vectors : numpy.array
The vectors to be stored.
fvocab : str, optional
File path used to save the vocabulary.
binary : bool, optional
If True, the data wil be saved in binary word2vec format, else it will be saved in plain text.
total_vec : int, optional
Explicitly specify total number of vectors
(in case word vectors are appended with document vectors afterwards).
"""
if not (vocab or vectors):
raise RuntimeError("no input")
if total_vec is None:
total_vec = len(vocab)
vector_size = vectors.shape[1]
if fvocab is not None:
logger.info("storing vocabulary in %s", fvocab)
with utils.open(fvocab, 'wb') as vout:
for word, vocab_ in sorted(iteritems(vocab), key=lambda item: -item[1].count):
vout.write(utils.to_utf8("%s %s\n" % (word, vocab_.count)))
logger.info("storing %sx%s projection weights into %s", total_vec, vector_size, fname)
assert (len(vocab), vector_size) == vectors.shape
with utils.open(fname, 'wb') as fout:
fout.write(utils.to_utf8("%s %s\n" % (total_vec, vector_size)))
# store in sorted order: most frequent words at the top
for word, vocab_ in sorted(iteritems(vocab), key=lambda item: -item[1].count):
row = vectors[vocab_.index]
if binary:
row = row.astype(REAL)
fout.write(utils.to_utf8(word) + b" " + row.tostring())
else:
fout.write(utils.to_utf8("%s %s\n" % (word, ' '.join(repr(val) for val in row))))
but I get the error:
INFO:gensim.models._fasttext_bin:loading 2000000 words for fastText model from cc.en.300.bin
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.word2vec:Updating model with new vocabulary
INFO:gensim.models.word2vec:New added 2000000 unique words (50% of original 4000000) and increased the count of 2000000 pre-existing words (50% of original 4000000)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 2000000 items
INFO:gensim.models.word2vec:sample=1e-05 downsamples 6996 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 390315457935 word corpus (70.7% of prior 552001338161)
INFO:gensim.models.fasttext:loaded (4000000, 300) weight matrix for fastText model from cc.en.300.bin
trials.py:42: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).
vectors.append(model[word])
INFO:__main__:storing 8664x300 projection weights into arrays_to_txt_oct3.txt
loading the model for: en
finish loading the model for: en
len(vectors): 8664
len(words): 8664
shape of vectors_array (8664, 300)
mission launched!
Traceback (most recent call last):
File "trials.py", line 102, in <module>
_save_word2vec_format(YOUR_VEC_FILE_PATH, words, vectors_array, fvocab=None, binary=False, total_vec=None)
File "trials.py", line 89, in _save_word2vec_format
for word, vocab_ in sorted(iteritems(vocab), key=lambda item: -item[1].count):
File "/cs/snapless/oabend/tailin/transdiv/lib/python3.7/site-packages/six.py", line 589, in iteritems
return iter(d.items(**kw))
AttributeError: 'list' object has no attribute 'items'
I understand that it has to do with the second argument in the function, but I don't understand how should I make a list of words into a dictionary object?
I tried doing that with:
#convert list of words into a dictionary
words_dict = {i:x for i,x in enumerate(words)}
But still got the error message:
Traceback (most recent call last):
File "trials.py", line 99, in <module>
_save_word2vec_format(YOUR_VEC_FILE_PATH, dict, vectors_array, fvocab=None, binary=False, total_vec=None)
File "trials.py", line 77, in _save_word2vec_format
total_vec = len(vocab)
TypeError: object of type '_io.TextIOWrapper' has no len()
I don't understand how to insert the word list in the right format...
You can directly import & re-use the Gensim KeyedVectors class to assemble your own (sub)set of word-vectors as one instance of KeyedVectors, then use its .save_word2vec_format() method.
For example, roughly this should work:
from gensim.models import KeyedVectors
words_file = open("word_list.txt","r") # your word-list as a text file
words_list = list(words_file) # reads each line of file into a new `list` object
fasttext_path = "cc.en.300.bin"
model = load_facebook_model(path)
kv = KeyedVectors(vector_size=model.wv.vector_size) # new empty KV object
vectors = []
for word in words_list:
vectors.append(model[word]) # vectors for words_list, in same order
kv.add(words_list, vectors) # adds those keys (words) & vectors as batch
kv.save_word2vec_format('my_kv.vec', binary=False)

Apply MinMaxScaler() on a pandas column

I am trying to use the sklearn MinMaxScaler to rescale a python column like below:
scaler = MinMaxScaler()
y = scaler.fit(df['total_amount'])
But got the following errors:
Traceback (most recent call last):
File "/Users/edamame/workspace/git/my-analysis/experiments/my_seq.py", line 54, in <module>
y = scaler.fit(df['total_amount'])
File "/Users/edamame/workspace/git/my-analysis/venv/lib/python3.4/site-packages/sklearn/preprocessing/data.py", line 308, in fit
return self.partial_fit(X, y)
File "/Users/edamame/workspace/git/my-analysis/venv/lib/python3.4/site-packages/sklearn/preprocessing/data.py", line 334, in partial_fit
estimator=self, dtype=FLOAT_DTYPES)
File "/Users/edamame/workspace/git/my-analysis/venv/lib/python3.4/site-packages/sklearn/utils/validation.py", line 441, in check_array
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[3.180000e+00 2.937450e+03 6.023850e+03 2.216292e+04 1.074589e+04
:
0.000000e+00 0.000000e+00 9.000000e+01 1.260000e+03].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Any idea what was wrong?
The input to MinMaxScaler needs to be array-like, with shape [n_samples, n_features]. So you can apply it on the column as a dataframe rather than a series (using double square brackets instead of single):
y = scaler.fit(df[['total_amount']])
Though from your description, it sounds like you want fit_transform rather than just fit (but I could be wrong):
y = scaler.fit_transform(df[['total_amount']])
A little more explanation:
If your dataframe had 100 rows, consider the difference in shape when you transform a column to an array:
>>> np.array(df[['total_amount']]).shape
(100, 1)
>>> np.array(df['total_amount']).shape
(100,)
The first returns a shape that matches [n_samples, n_features] (as required by MinMaxScaler), whereas the second does not.
Try to do with this way:
import pandas as pd
from sklearn import preprocessing
x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)

how to solve this error with lambda and sorted method when i try to make sentiment analysis (POS or NEG text)?

Input code:
best = sorted(word_scores.items(), key=lambda w, s: s, reverse=True)[:10000]
Result:
Traceback (most recent call last):
File "C:\Users\Sarah\Desktop\python\test.py", line 78, in <module>
best = sorted(word_scores.items(), key=lambda w, s: s, reverse=True)[:10000]
TypeError: <lambda>() missing 1 required positional argument: 's'
How do I solve it?
If I've understood the format of your word_scores dictionary correctly (that the keys are words and the values are integers representing scores), and you're simply looking to get an ordered list of words with the highest scores, it's as simple as this:
best = sorted(word_scores, key=word_scores.get, reverse=True)[:10000]
If you want to use a lambda to get an ordered list of tuples, where each tuple is a word and a score, and they are ordered by score, you can do the following:
best = sorted(word_scores.items(), key=lambda x: x[1], reverse=True)[:10000]
The difference between this and your original attempt is that I have passed one argument (x) to the lambda, and x is a tuple of length 2 - x[0] is the word and x[1] is the score. Since we want to sort by score, we use x[1].

Resources