I have two general functions Estability3 and Lstability3 where I would like to evaluate both two dimensional slices of arrays and one dimensional ranges of vectors. I have explored the error outside the functions in a jupyter notebook with some of the arguments to the functions.
These function compute energy and angular momentum. The position and velocity data needed to compute the energy and angular momentum is stored in a two dimensional matrix called xvec where the position and velocity are along a row and the three entries represent the three stars. xvec0 is the initial data for the simulation (timestep 0).
xvec0
array([[-5.00000000e+00, 0.00000000e+00, 0.00000000e+00, -0.00000000e+00, -2.23606798e+00, 0.00000000e+00],
[ 5.00000000e+00, 0.00000000e+00, 0.00000000e+00, -0.00000000e+00, 2.23606798e+00, 0.00000000e+00],
[ 9.95024876e+02, 0.00000000e+00, 0.00000000e+00, -0.00000000e+00, 4.46099737e-01, 0.00000000e+00]])
I select the first star of the zeroth timestep by selecting the first row of this matrix. If I were looping over thousands of timesteps like usual I would use thousands of matrices like these and append them to a list then convert to a numpy array with thousands of columns. (so xvec1_0 would have thousands of columns instead of one).
xvec1_0=xvec0[0]
Since xvec1_0 has only one column, here I am trying to force numpy to recognize it as a matrix. It doesn't work.
np.reshape(xvec1_0,(1,6))
array([[-5. , 0. , 0. , -0. , -2.23606798,
0. ]])
I see that it has two outer brackets, which implies that it is a matrix. But when I try to use the colon index over the one column like I normally do over the 1000s of columns, I get an error.
xvec1_0[:,0:3]
IndexError Traceback (most recent call last)
<ipython-input-115-79d26475ac10> in <module>
----> 1 xvec1_0[:,0:3]
IndexError: too many indices for array
Why can't I use the : operator to obtain the first row of this two dimensional array? How can I do that in this more general code that also applies to matrices?
Thanks,
Steven
I think I misread the function definition for reshape. I thought it changed it in place. It doesn't, I needed to assign an output, like this
xvec0_1 = np.reshape(xvec1_0,(1,6))
xvec1_0[:,0:3]
array([[-5., 0., 0.]])
xvec1_0
array([[-5. , 0. , 0. , -0. , -2.23606798,
0. ]])
xvec1_0.shape
(1, 6)
Thanks to a friend's help, I discovered that the following works just fine.
import numpy as np
x = np.zeros((1,6))
print(x.shape)
print(x[:,0:3])
x[:,0:3]
(1, 6)
[[0. 0. 0.]]
array([[0., 0., 0.]])
x = np.zeros((6,))
print(x.shape)
x = np.reshape(x, (1,6))
print(x[:,0:3])
x[:,0:3]
(6,)
[[0. 0. 0.]]
array([[0., 0., 0.]])
Probably I should have thought of some of these tests, but I thought I already had found the most basic test when I saw the output from np.reshape. I really appreciate the help from my friend, and hope my question did not waste anyone's time too badly.
For whatever reason this only returns 0 or 1 instead of float between them.
from sklearn import preprocessing
X = [[1.3, 1.6, 1.4, 1.45, 12.3, 63.01,],
[1.9, 0.01, 4.3, 45.4, 3.01, 63.01]]
minmaxscaler = preprocessing.MinMaxScaler()
X_scale = minmaxscaler.fit_transform(X)
print(X_scale) # returns [[0. 1. 0. 0. 1. 0.] [1. 0. 1. 1. 0. 0.]]
Minmax Scaler can not work with list of lists, it needs to work with numpy array for example (or dataframes).
You can convert to numpy array. It will result 6 features with 2 samples, which I guess is not what you means so you need also reshape.
import numpy
X = numpy.array([[1.3, 1.6, 1.4, 1.45, 12.3, 63.01,],
[1.9, 0.01, 4.3, 45.4, 3.01, 63.01]]).reshape(-1,1)
Results after MinMax Scaler:
[[0.02047619]
[0.0252381 ]
[0.02206349]
[0.02285714]
[0.19507937]
[1. ]
[0.03 ]
[0. ]
[0.06809524]
[0.72047619]
[0.04761905]
[1. ]]
Not exactly sure if you want to minimax each list separatly or all together
The answer which you have got from MinMaxScaler is the expected answer.
When you have only two datapoints, you will get only 0s and 1s. See the example here for three datapoints scenario.
You need to understand that it will convert the lowest value as 0 and highest values as 1 for each column. When you have more datapoints, the remaining ones would calculation based on the range (Max-min). see the formula here.
Also, MinMaxScaler accepts 2D data, which means lists of list is acceptable. Thats the reason why you did not got any error.
I have two lists:
list_1 = [['flavor', 'flavors', 'fruity_flavor', 'taste'],
['scent', 'scents', 'aroma', 'smell', 'odor'],
['mental_illness', 'mental_disorders','bipolar_disorder']
['romance', 'romances', 'romantic', 'budding_romance']]
list_2 = [['love', 'eating', 'spicy', 'hand', 'pulled', 'noodles'],
['also', 'like', 'buy', 'perfumes'],
['suffer', 'from', 'clinical', 'depression'],
['really', 'love', 'my', 'wife']]
I would like to compute the cosine similarity between the two lists above in such a way where the cosine similarity between the first sub-list in list1 and all sublists of list 2 are measured against each other. Then the same thing but with the second sub-list in list 1 and all sub-lists in list 2, etc.
The goal is to create a len(list_2) by len(list_1) matrix, and each entry in that matrix is a cosine similarity score. Currently I've done this the following way:
import gensim
import numpy as np
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('./data/GoogleNews-vectors-negative300.bin.gz', binary=True)
similarity_mat = np.zeros([len(list_2), len(list_1)])
for i, L2 in enumerate(list_2):
for j, L1 in enumerate(list_1):
similarity_mat[i, j] = model.n_similarity(L2, L1)
However, I'd like to implement this with matrix multiplication and no for loops.
My two questions are:
Is there a way to do some sort of element-wise matrix multiplication but with gensim's n_similiarity() method to generate the required matrix?
Would it be more efficient and faster using the current method or matrix multiplication?
I hope my question was clear enough, please let me know if I can clarify even further.
Here's an approach, but it's not clear from the question whether you understand the underlying mechanics of the calculation, which might be causing the block.
I've changed the input strings to give more exact word matches, and given the two strings different dimensions to make it a bit clearer:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
list_1 = [['flavor', 'flavors', 'fruity_flavor', 'taste'],
['scent', 'my', 'aroma', 'smell', 'odor'],
['mental_illness', 'mental_disorders','bipolar_disorder'],
['romance', 'romances', 'romantic', 'budding_romance']]
list_2 = [['love', 'eating', 'spicy', 'hand', 'pulled', 'noodles'],
['also', 'like', 'buy', 'perfumes'],
['suffer', 'from', 'clinical', 'depression'],
['really', 'love', 'my', 'wife'],
['flavor', 'taste', 'romantic', 'aroma', 'what']]
cnt = CountVectorizer()
# Combine each sublist into single str, and join everything into corpus
combined_lists = ([' '.join(item) for item in list_1] +
[' '.join(item) for item in list_2])
count_matrix = cnt.fit_transform(combined_lists).toarray()
# Split them again into list_1 and list_2 word counts
count_matrix_1 = count_matrix[:len(list_1),]
count_matrix_2 = count_matrix[len(list_1):,]
match_matrix = np.matmult(count_matrix_1, count_matrix_2.T)
Output of match_matrix:
array([[0, 0, 0, 0, 2],
[0, 0, 0, 1, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 1]], dtype=int64)
You can see that the 1st string in list_1 has 2 matches with the 5th string in list_2, and so on.
So the first part of the calculation (the dot product) has been calculated. Now we need the magnitudes:
magnitudes = np.array([np.linalg.norm(count_matrix[i,:])
for i in range(len(count_matrix))])
Now we can use matrix multiplication to turn that into a matrix of divisors (we need to reshape magnitudes into n x 1 and 1 x n matrices for this to produce an n x n matrix:
divisor_matrix = np.matmul(magnitudes.reshape(len(magnitudes),1),
magnitudes.reshape(1,len(magnitudes)))
Now since we didn't compare every single sublist, but only the list_1 with the list_2 sublists, we need to take a subsection of this divisor matrix to get the right magnitudes:
divisor_matrix = divisor_matrix[:len(list_1), len(list_1):]
Output:
array([[4.89897949, 4. , 4. , 4. , 4.47213595],
[5.47722558, 4.47213595, 4.47213595, 4.47213595, 5. ],
[4.24264069, 3.46410162, 3.46410162, 3.46410162, 3.87298335],
[4.89897949, 4. , 4. , 4. , 4.47213595]])
Now we can calculate the final matrix of cosine similarity scores:
cos_sim = match_matrix / divisor_matrix
Output:
array([[0. , 0. , 0. , 0. , 0.4472136],
[0. , 0. , 0. , 0.2236068, 0.2 ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.2236068]])
Note these scores differ from the example given, since in the example every cosine similarity score would be 0.
There are two problems in code, the second last and last line.
import gensim
import numpy as np
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('/root/input/GoogleNews-vectors-negative300.bin.gz', binary=True)
similarity_mat = np.zeros([len(list_2), len(list_1)])
for i, L2 in enumerate(list_2):
for j, L1 in enumerate(list_1):
similarity_mat[i, j] = model.n_similarity(L2, L1)
Answers to you questions:
1. You are already using a direct function to calculate the similarity between two sentences(L1 and L2) which are first converted to two vectors and then cosine similarity is calculated of those two vectors. Everything is already done inside the n_similarity() so you can't do any kind of matrix multiplication.
If you want to do your own matrix multiplication then instead of directly using n_similarity() calculates the vectors of the sentences and then apply matrix multiplication while calculating cosine similarity.
2. As I said in (1) that everything is done in n_similarity() and creators of gensim takes care of the efficiency when writing the libraries so any other multiplication method will most likely not make a difference.
I try to use the function NearestNeighbors on Sklearn. I write an example to understand what's happening on these function.
from sklearn.neighbors import NearestNeighbors
samples = [[0.2, 0], [0.5, 0.1], [0.4,0.4]]
neigh = NearestNeighbors(n_neighbors=2,metric='mahalanobis')
neigh.fit(samples)
print(neigh.kneighbors([[272,7522752]])) # use any point to test
Above code work well and it can correctly compute the 2 - nearest point .
But when I try to use my dataset , and some mistakes are happend. Dataset matrix are 9959 * 384 matrix. I print the matrix below , and I declare the matrix training_data
[[ 0.069915 0.020142 0.070054 ..., 0.333937 0.477351 0.055993]
[ 0.131826 0.038203 0.131573 ..., 0.353589 0.426197 0.048557]
[ 0.130338 0.02595 0.130351 ..., 0.315951 0.32355 0.098884]
...,
[ 0.053331 0.023395 0.0534 ..., 0.366064 0.404756 0.066217]
[ 0.063554 0.021197 0.063671 ..., 0.235945 0.439595 0.105366]
[ 0.123632 0.045492 0.12322 ..., 0.308702 0.437344 0.040144]]
And when I use training_data into above code which just change the samples to training_data, it has a mistake.
LinAlgError: 0-dimensional array given. Array must be at least two- dimensional
Please help me solve these questions, tks a lot !
The scikit-learn function pairwise_distances provides the distance matrix from an array X.
However for some inputs the results seems not to be precise.
Example:
from sklearn.metrics.pairwise import pairwise_distances
X = [[-0.903858372568, -0.5521578], [-0.903858372568, -0.55215782]]
print pairwise_distances(X)
Gives the following output:
[[ 0. 0.]
[ 0. 0.]]
Although there is a distance of 0.00000002.
2nd Example:
X = [[-0.903858372568, -0.5521578], [-0.903858372568, -0.552157821]]
gives
[[ 0.00000000e+00 2.10734243e-08]
[ 2.10734243e-08 0.00000000e+00]]
Here there is a distance but is only correct up to the first digit.
For my application it is undesirable if the output can be zero although there is a distance.
Is there a good way to increase the precision?
I didn't dig on why scikit-learn gives such unprecise result, but it seems scipy gives better precision. Try this:
from scipy.spatial.distance import pdist, squareform
squareform(pdist(X))
For example,
X = [[-0.903858372568, -0.5521578], [-0.903858372568, -0.552157821]]
gives
array([[ 0.00000000e+00, 2.10000000e-08],
[ 2.10000000e-08, 0.00000000e+00]])