I am working on a clustering problem. There's a situation where I have 3 cluster centers as below, and I want to calculate euclidean distance from these 3 cluster centers from another m*n dimensional matrix. It would be very helpful if anyone can guide me through this.
kmeans.cluster_centers_
Out[99]:
array([[-2.23020213, 0.35654288],
[ 7.69370352, 1.72991757],
[ 0.92519202, -0.29218753]])
matrix
Out[100]:
array([[ 0.11650485, 0.11650485, 0.11650485, 0.11650485, 0.11650485,
0.11650485],
[ 0.11650485, 0.18446602, 0.18446602, 0.2815534 , 0.37864078,
0.37864078],
[ 0.21359223, 0.21359223, 0.21359223, 0.21359223, 0.29708738,
0.35533981],
...,
[ 0.2640625 , 0.2734375 , 0.30546875, 0.31953125, 0.31953125,
0.31953125],
[ 1. , 1. , 1. , 1. , 1. ,
1. ],
[ 0.5 , 0.5 , 0.5 , 0.5 , 0.5 ,
0.5 ]])
I want to do it in Python. I have used sklearn for my clustering.
Euclidean distance is defined on vectors of a fixed length d.
I.e.it is a function R^d x R^d -> R
So whatever you are trying to do - it is not the usual Euclidean distance. You seem to have k=3 cluster centers with d=2 coordinates, but your matrix has an incompatible shape that cannot be interpreted in an obvious way as 2 d vectors.
Related
For whatever reason this only returns 0 or 1 instead of float between them.
from sklearn import preprocessing
X = [[1.3, 1.6, 1.4, 1.45, 12.3, 63.01,],
[1.9, 0.01, 4.3, 45.4, 3.01, 63.01]]
minmaxscaler = preprocessing.MinMaxScaler()
X_scale = minmaxscaler.fit_transform(X)
print(X_scale) # returns [[0. 1. 0. 0. 1. 0.] [1. 0. 1. 1. 0. 0.]]
Minmax Scaler can not work with list of lists, it needs to work with numpy array for example (or dataframes).
You can convert to numpy array. It will result 6 features with 2 samples, which I guess is not what you means so you need also reshape.
import numpy
X = numpy.array([[1.3, 1.6, 1.4, 1.45, 12.3, 63.01,],
[1.9, 0.01, 4.3, 45.4, 3.01, 63.01]]).reshape(-1,1)
Results after MinMax Scaler:
[[0.02047619]
[0.0252381 ]
[0.02206349]
[0.02285714]
[0.19507937]
[1. ]
[0.03 ]
[0. ]
[0.06809524]
[0.72047619]
[0.04761905]
[1. ]]
Not exactly sure if you want to minimax each list separatly or all together
The answer which you have got from MinMaxScaler is the expected answer.
When you have only two datapoints, you will get only 0s and 1s. See the example here for three datapoints scenario.
You need to understand that it will convert the lowest value as 0 and highest values as 1 for each column. When you have more datapoints, the remaining ones would calculation based on the range (Max-min). see the formula here.
Also, MinMaxScaler accepts 2D data, which means lists of list is acceptable. Thats the reason why you did not got any error.
I've been playing around with numpy's linalg module and wanted to get the eigenvectors for the following matrix:
import numpy as np
matrix = np.array([[4,0,-1],[0,3,0],[1,0,2]])
w,v = np.linalg.eig(matrix)
print(v)
array([[0.70710678, 0.70710678, 0. ],
[0. , 0. , 1. ],
[0.70710678, 0.70710678, 0. ]])
Calculating the eigenvectors by hand gives me only two vectors which are [1,0,1] and [0,1,0]. I know that numpy normalizes the vectors which is fine but the problem is when I try to check if the first and second columns are equal:
v[:,0] == v[:,1]
array([False, True, False])
This gives me the impression that these are two different vectors (so I now have a total of 3 eigenvectors) when I already know I'll only get two.
Can someone please explain what's going on here.
I have two lists:
list_1 = [['flavor', 'flavors', 'fruity_flavor', 'taste'],
['scent', 'scents', 'aroma', 'smell', 'odor'],
['mental_illness', 'mental_disorders','bipolar_disorder']
['romance', 'romances', 'romantic', 'budding_romance']]
list_2 = [['love', 'eating', 'spicy', 'hand', 'pulled', 'noodles'],
['also', 'like', 'buy', 'perfumes'],
['suffer', 'from', 'clinical', 'depression'],
['really', 'love', 'my', 'wife']]
I would like to compute the cosine similarity between the two lists above in such a way where the cosine similarity between the first sub-list in list1 and all sublists of list 2 are measured against each other. Then the same thing but with the second sub-list in list 1 and all sub-lists in list 2, etc.
The goal is to create a len(list_2) by len(list_1) matrix, and each entry in that matrix is a cosine similarity score. Currently I've done this the following way:
import gensim
import numpy as np
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('./data/GoogleNews-vectors-negative300.bin.gz', binary=True)
similarity_mat = np.zeros([len(list_2), len(list_1)])
for i, L2 in enumerate(list_2):
for j, L1 in enumerate(list_1):
similarity_mat[i, j] = model.n_similarity(L2, L1)
However, I'd like to implement this with matrix multiplication and no for loops.
My two questions are:
Is there a way to do some sort of element-wise matrix multiplication but with gensim's n_similiarity() method to generate the required matrix?
Would it be more efficient and faster using the current method or matrix multiplication?
I hope my question was clear enough, please let me know if I can clarify even further.
Here's an approach, but it's not clear from the question whether you understand the underlying mechanics of the calculation, which might be causing the block.
I've changed the input strings to give more exact word matches, and given the two strings different dimensions to make it a bit clearer:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
list_1 = [['flavor', 'flavors', 'fruity_flavor', 'taste'],
['scent', 'my', 'aroma', 'smell', 'odor'],
['mental_illness', 'mental_disorders','bipolar_disorder'],
['romance', 'romances', 'romantic', 'budding_romance']]
list_2 = [['love', 'eating', 'spicy', 'hand', 'pulled', 'noodles'],
['also', 'like', 'buy', 'perfumes'],
['suffer', 'from', 'clinical', 'depression'],
['really', 'love', 'my', 'wife'],
['flavor', 'taste', 'romantic', 'aroma', 'what']]
cnt = CountVectorizer()
# Combine each sublist into single str, and join everything into corpus
combined_lists = ([' '.join(item) for item in list_1] +
[' '.join(item) for item in list_2])
count_matrix = cnt.fit_transform(combined_lists).toarray()
# Split them again into list_1 and list_2 word counts
count_matrix_1 = count_matrix[:len(list_1),]
count_matrix_2 = count_matrix[len(list_1):,]
match_matrix = np.matmult(count_matrix_1, count_matrix_2.T)
Output of match_matrix:
array([[0, 0, 0, 0, 2],
[0, 0, 0, 1, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 1]], dtype=int64)
You can see that the 1st string in list_1 has 2 matches with the 5th string in list_2, and so on.
So the first part of the calculation (the dot product) has been calculated. Now we need the magnitudes:
magnitudes = np.array([np.linalg.norm(count_matrix[i,:])
for i in range(len(count_matrix))])
Now we can use matrix multiplication to turn that into a matrix of divisors (we need to reshape magnitudes into n x 1 and 1 x n matrices for this to produce an n x n matrix:
divisor_matrix = np.matmul(magnitudes.reshape(len(magnitudes),1),
magnitudes.reshape(1,len(magnitudes)))
Now since we didn't compare every single sublist, but only the list_1 with the list_2 sublists, we need to take a subsection of this divisor matrix to get the right magnitudes:
divisor_matrix = divisor_matrix[:len(list_1), len(list_1):]
Output:
array([[4.89897949, 4. , 4. , 4. , 4.47213595],
[5.47722558, 4.47213595, 4.47213595, 4.47213595, 5. ],
[4.24264069, 3.46410162, 3.46410162, 3.46410162, 3.87298335],
[4.89897949, 4. , 4. , 4. , 4.47213595]])
Now we can calculate the final matrix of cosine similarity scores:
cos_sim = match_matrix / divisor_matrix
Output:
array([[0. , 0. , 0. , 0. , 0.4472136],
[0. , 0. , 0. , 0.2236068, 0.2 ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.2236068]])
Note these scores differ from the example given, since in the example every cosine similarity score would be 0.
There are two problems in code, the second last and last line.
import gensim
import numpy as np
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('/root/input/GoogleNews-vectors-negative300.bin.gz', binary=True)
similarity_mat = np.zeros([len(list_2), len(list_1)])
for i, L2 in enumerate(list_2):
for j, L1 in enumerate(list_1):
similarity_mat[i, j] = model.n_similarity(L2, L1)
Answers to you questions:
1. You are already using a direct function to calculate the similarity between two sentences(L1 and L2) which are first converted to two vectors and then cosine similarity is calculated of those two vectors. Everything is already done inside the n_similarity() so you can't do any kind of matrix multiplication.
If you want to do your own matrix multiplication then instead of directly using n_similarity() calculates the vectors of the sentences and then apply matrix multiplication while calculating cosine similarity.
2. As I said in (1) that everything is done in n_similarity() and creators of gensim takes care of the efficiency when writing the libraries so any other multiplication method will most likely not make a difference.
I have a speed of feature points at every frame. Here I have 165 frames in a video where every frame contains speed of feature points.This is my data.
TrajDbscanData
array([[ 1. , 0.51935178],
[ 1. , 0.52063496],
[ 1. , 0.54598193],
...,
[165. , 0.47198981],
[165. , 2.2686042 ],
[165. , 0.79044946]])
where first index is frame number and second one is speed of a feature point at that frame.
Here I want to do density based clustering for different speed range. For this , I use following code.
import sklearn.cluster as sklc
core_samples, labels_db = sklc.dbscan(
TrajDbscanData, # array has to be (n_samples, n_features)
eps=0.5,
min_samples=15,
metric='euclidean',
algorithm='auto'
)
core_samples_mask = np.zeros_like(labels_db, dtype=bool)
core_samples_mask[core_samples] = True
unique_labels = set(labels_db)
n_clusters_ = len(unique_labels) - (1 if -1 in labels_db else 0)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
plt.figure(figcount)
figcount+=1
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = 'k'
class_member_mask = (labels_db == k)
xy = TrajDbscanData[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col, markeredgecolor='k', markersize=6)
xy = TrajDbscanData[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'x', markerfacecolor=col, markeredgecolor='k', markersize=4)
plt.rcParams["figure.figsize"] = (10,7)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.grid(True)
plt.show()
I got the following result.
Y axis is speed and x axis is frame number
I want to do density based clustering according to speed. for example speed upto 1.0 in one cluster , speed from 1 to 1.5 as outlier , speed from 1.5 to 2.0 another cluster and speed above 2.0 in another cluster. This helps to identify common motion pattern types. How can I do this ?
Don't use Euclidean distance.
Since your x and y a is have very different meaning, that is the wrong distance function to use.
Your plot is misleading, because the axes have different scale. If you would scale x and y the same way, you would understand what has been happening... The y axis is effectively ignored, and you slice the data by your discrete integer time axis.
You may need to use Generalized DBSCAN and treat time and value separately!
I seem to be unable to find out how to vectorize this py3 loop
import numpy as np
a = np.array([-72, -10, -70, 37, 68, 9, 1, -3, 2, 3, -6, -4, ], np.int16)
result = np.array([-72, -10, -111, -23, 1, -2, 1, -3, 1, 2, -5, -5, ], np.int16)
b = np.copy(a)
for i in range(2, len(b)):
b[i] += int( (b[i-1] + b[i-2]) / 2)
assert (b == result).all()
I tried playing with np.convolve and pandas.rolling_apply but couldn't get it working. Maybe this is the time to learn about c-extensions?
It would be great to get the time for this down to something like 50..100ms for input arrays of ~500k elements.
#hpaulj asked in his answer for a closed expression of b[k] in terms of a[:k]. I didn't think it existed, but I worked a bit on it and indeed found that the closed form contains a bunch of Jacobsthal numbers as #Divakar pointed out.
Here is one closed form:
J_n here is the Jacobsthal number, when expanding it like this:
J_n = (2^n - (-1)^n) / 3
one ends up with an expression which I can imagine to use a vectorized implementation ...
Most numpy code operates on the whole array at once. Ok it iterates in C code but buffered in a way that it doesn't matter which element is used first.
Here changes to b[2] affect the value calculated for b[3] and on down the line.
add.at and other such ufunc do unbuffered calculations. This allows you to add some value repeatedly to one element. I played a bit with it in that case, but no luck so far.
cumsum and cumprod are also handy for problems were values depend earlier ones.
Is it possible to generalize the calculation, so as to as define b[i] in terms of all the a[:i]. We know b[2] as a function of a[:2], but what of b[3]?
Even if we go this working for floats, it might be off when doing integer divisions.
I think you already have the sane solution. Any other vectorization would rely on floating point calculations and it would be really difficult to keep track of the error accumulation. For example say you want to have a matrix vector multiplication: for the first seven terms the matrix would look like
array([[ 1. , 0. , 0. , 0. , 0. , 0. , 0. ],
[ 0. , 1. , 0. , 0. , 0. , 0. , 0. ],
[ 0.5 , 0.5 , 1. , 0. , 0. , 0. , 0. ],
[ 0.25 , 0.75 , 0.5 , 1. , 0. , 0. , 0. ],
[ 0.375 , 0.625 , 0.75 , 0.5 , 1. , 0. , 0. ],
[ 0.3125 , 0.6875 , 0.625 , 0.75 , 0.5 , 1. , 0. ],
[ 0.34375, 0.65625, 0.6875 , 0.625 , 0.75 , 0.5 , 1. ]])
The relationship can be described as the iterative formula
[ a[i-2] ]
b[i] = [0.5 , 0.5 , 1] [ a[i-1] ]
[ a[i] ]
That defines a series of elementary matrices of the form of an identity matrix with
[0 ... 0.5 0.5 1 0 ... 0]
on the ith row. And successive multiplication gives the matrix above for the first seven terms. There is indeed a subdiagonal structure but the terms are getting too small very quickly. As you have shown 2 to the power 500k is not fun.
In order to keep track of floating point noise, an iterative solution is required which is what you have anyways.