Hierarchical Clustering with cosine similarity metric in fcluster package

Hierarchical Clustering with cosine similarity metric in fcluster package - linux

I use scipy.cluster.hierarchy to do a hierarchical clustering on a set of points using "cosine" similarity metric. As an example, I have:
import scipy.cluster.hierarchy as hac
import matplotlib.pyplot as plt
Points =
np.array([[ 0. , 0.23508573],
[ 0.00754775 , 0.26717266],
[ 0.00595464 , 0.27775905],
[ 0.01220563 , 0.23622067],
[ 0.00542628 , 0.14185873],
[ 0.03078922 , 0.11273108],
[ 0.06707743 ,-0.1061131 ],
[ 0.04411757 ,-0.10775407],
[ 0.01349434 , 0.00112159],
[ 0.04066034 , 0.11639591],
[ 0. , 0.29046682],
[ 0.07338036 , 0.00609912],
[ 0.01864988 , 0.0316196 ],
[ 0. , 0.07270636],
[ 0. , 0. ]])
z = hac.linkage(Points, metric='cosine', method='complete')
labels = hac.fcluster(z, 0.1, criterion="distance")
plt.scatter(Points[:, 0], Points[:, 1], c=labels.astype(np.float))
plt.show()
Since I use cosine metric, in some cases the dot product of two vectors can be negative or norm of some vectors can be zero. It means z output will have some negative or infinite elements which is not valid for fcluster (as below):
z =
[[ 0.00000000e+00 1.00000000e+01 0.00000000e+00 2.00000000e+00]
[ 1.30000000e+01 1.50000000e+01 0.00000000e+00 3.00000000e+00]
[ 8.00000000e+00 1.10000000e+01 4.26658708e-13 2.00000000e+00]
[ 1.00000000e+00 2.00000000e+00 2.31748880e-05 2.00000000e+00]
[ 3.00000000e+00 4.00000000e+00 8.96700489e-05 2.00000000e+00]
[ 1.60000000e+01 1.80000000e+01 3.98805492e-04 5.00000000e+00]
[ 1.90000000e+01 2.00000000e+01 1.33225099e-03 7.00000000e+00]
[ 5.00000000e+00 9.00000000e+00 2.41120340e-03 2.00000000e+00]
[ 6.00000000e+00 7.00000000e+00 1.52914684e-02 2.00000000e+00]
[ 1.20000000e+01 2.20000000e+01 3.52441432e-02 3.00000000e+00]
[ 2.10000000e+01 2.40000000e+01 1.38662986e-01 1.00000000e+01]
[ 1.70000000e+01 2.30000000e+01 6.99056531e-01 4.00000000e+00]
[ 2.50000000e+01 2.60000000e+01 1.92543748e+00 1.40000000e+01]
[ -1.00000000e+00 2.70000000e+01 inf 1.50000000e+01]]
To solve this problem, I checked linkage() function and inside it I needed to check _hierarchy.linkage() method. I use pycharm text editor and when I asked for "linkage" source code, it opened up a python file namely "_hierarchy.py" inside the directory like the following:
.PyCharm40/system/python_stubs/-1247972723/scipy/cluster/_hierarchy.py
This python file doesn't have any definition for all included functions.
I am wondering what is the correct source of this function to revise it or is there another way to solve this problem.
I would be appreciated for your helps and hints.

You have a zero vector 0 0 in your data set. For such data, cosine distance is undefined, so you are using an inappropriate distance function!
This is a definition gap that cannot be trivially closed. inf is as incorrect as 0. The distance to 0 0 with cosine cannot be defined without contraditions. You must not use cosine on such data.
Back to your actual question: _hierarchy is a Cython module. It is not pure python, but it is compiled to native code. You can easily see the source code on Github:
https://github.com/scipy/scipy/blob/master/scipy/cluster/_hierarchy.pyx

Related

Changing the values of matrix is changing the weights of the model

I am working with neural network weights and I am seeing a weird thing. I have written this code:
x = list(mnist_classifier.named_parameters())
weight = x[0][1].detach().cpu().numpy().squeeze()
print(weight)
So I get the following values:
[[[-0.2435195 0.05255396 -0.32765684]
[ 0.06372751 0.03564635 -0.31417745]
[ 0.14694464 -0.03277654 -0.10328879]]
[[-0.13716389 0.0128522 0.24107361]
[ 0.45231998 0.15497956 0.11112727]
[ 0.18206735 -0.22820294 -0.29146808]]
[[ 1.1747813 0.9206593 0.49848938]
[ 1.1558323 1.0859997 0.7743778 ]
[ 1.0287125 0.52122927 0.4096022 ]]
[[-0.2980809 -0.04358199 -0.26461622]
[-0.1165191 -0.2267315 0.37054354]
[ 0.4429275 0.44967037 0.06866694]]
[[ 0.39549246 0.10898255 0.32859102]
[-0.07753246 0.1628792 0.03021396]
[ 0.323148 0.5103844 0.16282919]]
....
Now, when I change the value of the first matrix weight[0] to 0.1, it changes the values of the original weights:
x = list(mnist_classifier.named_parameters())
weight = x[0][1].detach().cpu().numpy().squeeze()
weight[0] = weight[0] * 0 + 0.1
print(list(mnist_classifier.named_parameters()))
[('conv1.weight', Parameter containing:
tensor([[[[ 0.1000, 0.1000, 0.1000],
[ 0.1000, 0.1000, 0.1000],
[ 0.1000, 0.1000, 0.1000]]],
[[[-0.1372, 0.0129, 0.2411],
[ 0.4523, 0.1550, 0.1111],
[ 0.1821, -0.2282, -0.2915]]],
[[[ 1.1748, 0.9207, 0.4985],
[ 1.1558, 1.0860, 0.7744],
[ 1.0287, 0.5212, 0.4096]]],
...
What is going on here? How is weight[0] connected to the neural network?

I found the answer. Apparently, when copying np arrays, you are supposed to use copy() otherwise it's a pass-by reference. So using copy() helped.

Difference in output of spacy nlp .vector when applied on sentence?

I am doing the following:
import spicy
nlp = spacy.load("en")
doc = nlp('Hello Stack Over Flow, my name is Steve')
doc.vector:
In [1]: doc = nlp('Hello Stack Over Flow, my name is Steve')
In [2]: doc.vector
Out[2]:
array([ 1.67874452e-02, 1.43885329e-01, -1.64147541e-01, -3.52525562e-02,
1.71078995e-01, 5.81666678e-02, 1.42294103e-02, -1.58536658e-01,
-1.17119223e-01, 1.00338888e+00, -1.03455082e-01, 5.80027774e-02,
5.08872233e-02, -2.64734793e-02, -4.76809964e-02, -3.61649990e-02,
-4.25985567e-02, 4.86545563e-01, -5.22996634e-02, 2.66118869e-02,
-7.14791119e-02, 2.33504437e-02, -1.01438001e-01, 1.78358995e-03,
6.41188920e-02, -1.93965547e-02, -1.72182247e-02, -4.99197766e-02,
3.82994451e-02, 2.89904438e-02, 1.10834874e-01, 1.07230783e-01,
1.72666041e-03, 9.85269994e-02, -2.64622234e-02, 1.47332232e-02,
1.49853658e-02, -3.25594470e-02, -2.28943750e-02, -6.28201067e-02,
-4.13866527e-03, 4.12439965e-02, -1.09200180e-03, -3.77365127e-02,
3.02788876e-02, -2.47912239e-02, -3.86282206e-02, -8.49756673e-02,
8.79433304e-02, -7.35666696e-03, -2.35625561e-02, 1.29868105e-01,
-8.24742168e-02, 3.79751101e-02, 6.52077794e-03, 4.12433175e-03,
-4.44555469e-03, -8.54532197e-02, 4.30566669e-02, -4.90945578e-02,
1.08687999e-02, -3.58653292e-02, 3.19277793e-02, 1.70548886e-01,
7.04367757e-02, -1.03306666e-01, -6.25603348e-02, -4.16669573e-05,
-9.90156457e-03, 4.87144403e-02, -6.59128875e-02, 2.21944507e-03,
6.23853356e-02, -1.16886329e-02, -2.20711138e-02, 1.35971338e-01,
5.85511066e-02, -2.78507806e-02, -4.42699976e-02, 1.22686662e-01,
-4.96295579e-02, 8.47733300e-03, -1.72136649e-02, 3.73593345e-02,
1.38313353e-01, -1.81285888e-01, 8.07836726e-02, -1.01186670e-01,
1.90296680e-01, -8.37400090e-03, -4.79855575e-02, 4.62987460e-02,
4.97333193e-03, 1.08253332e-02, 1.37178123e-01, -4.36927788e-02,
-9.02644824e-03, 2.52826661e-02, -2.60283332e-02, 7.33327791e-02,
-4.21555527e-02, -9.45088938e-02, -2.36399993e-02, -2.59645544e-02,
-1.17972204e-02, -7.21249953e-02, -1.62978880e-02, 4.46572453e-02,
8.05888604e-03, 1.73073336e-02, -1.11245394e-01, -1.35631096e-02,
4.26412188e-02, -1.24742221e-02, -4.93782237e-02, -3.84650044e-02,
9.32500139e-03, -2.58344412e-02, 5.39288903e-03, -2.51024440e-02,
-1.68177821e-02, 1.81681886e-02, 6.95144460e-02, 5.96744493e-02,
1.28178876e-02, 8.18611085e-02, 2.03688871e-02, -1.45592675e-01,
-2.97091678e-02, 1.67966553e-03, 2.56901123e-02, -1.57507751e-02,
-3.29821557e-02, 3.69144455e-02, 2.69458871e-02, -7.87097737e-02,
-3.22544426e-02, 9.35557822e-04, 2.51506642e-02, -1.39920013e-02,
-5.63631117e-01, 1.28184333e-01, 8.25011209e-02, 4.69026715e-02,
-2.58401129e-02, 3.11454497e-02, 7.81277791e-02, -1.18433349e-02,
2.19431128e-02, 2.38199951e-03, -2.19482221e-02, 5.75609989e-02,
1.32304668e-01, 4.28974479e-02, -1.32128010e-02, 4.54772264e-02,
-9.00077820e-02, -7.34564438e-02, -8.14672261e-02, -5.10835573e-02,
-3.27358916e-02, 2.09213328e-02, 5.85612208e-02, -2.49340013e-02,
-1.03430830e-01, -1.28346771e-01, 4.52880040e-02, 5.96577907e-03,
1.12773672e-01, -3.90797779e-02, -5.79966642e-02, 4.97789842e-05,
2.49000057e-03, -2.88800001e-02, -9.96003374e-02, 3.41123343e-02,
-3.62301096e-02, -7.10571110e-02, -5.67906946e-02, 4.61289100e-03,
7.72120059e-02, -1.36105552e-01, -6.25717789e-02, -8.04037750e-02,
2.12122276e-02, -6.30133413e-03, -9.87700000e-02, 6.31399453e-02,
-8.64481106e-02, -4.26407792e-02, -8.36099982e-02, 1.07030040e-02,
-1.34339988e-01, 6.82333438e-03, 5.62012270e-02, 6.89233318e-02,
5.61566688e-02, -9.32652280e-02, 6.18273281e-02, 1.12723336e-01,
-1.04766667e-01, -2.15716790e-02, -1.15266666e-01, 4.57017794e-02,
7.47987852e-02, -9.02220607e-04, 7.75654465e-02, -2.66306698e-02,
1.93627775e-02, -4.89100069e-03, -1.43213451e-01, -6.52845576e-02,
1.64663326e-02, -5.07618897e-02, -1.49422223e-02, 4.21274304e-02,
1.06691113e-02, -5.97029589e-02, -1.20738111e-01, -1.61822215e-02,
-5.95551059e-02, 3.67141105e-02, 2.88833342e-02, 5.24356700e-02,
7.51844468e-03, -3.79579999e-02, 9.96864438e-02, 1.28289998e-01,
1.56755541e-02, -1.55926663e-02, -4.89732213e-02, 2.24273317e-02,
-9.15533304e-03, 7.32631087e-02, -7.48946667e-02, -1.15108885e-01,
-5.56773357e-02, -8.49866867e-03, -3.00188921e-02, 3.55113335e-02,
-4.22161110e-02, 7.19971135e-02, 3.67489979e-02, -1.00055551e-02,
7.52926618e-02, -1.43726662e-01, -4.08722041e-03, -1.49663329e-01,
1.41400262e-03, 5.52397817e-02, 8.86320025e-02, -7.44862184e-02,
-3.23222089e-03, 3.30205560e-02, 3.77681069e-02, 6.58650026e-02,
2.83081792e-02, -3.24210003e-02, 1.93070006e-02, 5.67157790e-02,
6.17166609e-02, 1.09540010e-02, 4.71896678e-02, 7.68444464e-02,
-2.51592230e-02, -4.28744499e-03, -2.40004435e-02, 3.28795537e-02,
1.25606894e-01, -6.05716556e-02, 5.52507788e-02, -2.12161113e-02,
-8.45399946e-02, -7.95067847e-02, -1.33965556e-02, -5.02544455e-02,
-3.03339995e-02, 1.19719980e-02, 6.15093298e-02, 1.11455554e-02,
1.24445252e-01, 5.54273315e-02, 1.28475904e-01, -9.19478834e-02,
-2.29498874e-02, -4.18815538e-02, 5.02915531e-02, -1.14721097e-02,
1.06602885e-01, -8.45602229e-02, -4.17976640e-02, 1.39088994e-02,
-2.19033333e-03, 7.99388885e-02, 1.08606648e-02, -1.27933361e-02,
-2.84678000e-03, -2.97433343e-02, -8.61347839e-02, 9.06177703e-03],
dtype=float32)
But when I running the following I get:
In [3]: for token in doc: print("{} : {}".format(token, token.vector[:3]))
Hello : [0. 0. 0.]
Stack : [0. 0. 0.]
Over : [0. 0. 0.]
Flow : [0. 0. 0.]
, : [-0.082752 0.67204 -0.14987 ]
my : [ 0.08649 0.14503 -0.4902 ]
name : [ 0.23231 -0.024102 -0.83964 ]
is : [-0.084961 0.502 0.0023823]
Steve : [0. 0. 0.]
Please advise why do I get different representations?
The first vector is whole sentence representation?
Please explain me why do I get different vectors?

The solution is: A real-valued meaning representation. Defaults to an average of the token vectors.
Source: https://spacy.io/api/doc#vector
Hope it will help others too.

Log_probabilities returned by tf.nn.ctc_beam_search_decoder

I am training a LSTM-CTC speech recognition system with using beam search decoding in the following configuration:
decoded, log_prob =
tf.nn.ctc_beam_search_decoder(
inputs,
sequence_length,
beam_width=100,
top_paths=3,
merge_repeated=True
)
The output of log_probabilities for a batch by the above decoder are like:
[[ 14.73168373, 14.45586109, 14.35735512],
[ 20.45407486, 20.44991684, 20.41798401],
[ 14.9961853 , 14.925807 , 14.88066769],
...,
[ 18.89863396, 18.85992241, 18.85712433],
[ 3.93567419, 3.92791557, 3.89198923],
[ 14.56258488, 14.55923843, 14.51092243]],
So how do these scores represent log probabilities and if I want to compare confidence for top paths among examples then what will be the normalisation factor?

How do you compute the distance between text documents for k-means with word2vec?

I have recently been introduced to word2vec and I'm having some trouble trying to figure out how exactly it is used for k-means clustering.
I do understand how k-means works with tf-idf vectors. For each text document you have a vector of tf-idf values and after choosing some documents as initial cluster centers, you can use the euclidian distance to minimise the the distances between the vectors of the documents. Here's an example.
However, when using word2vec, each word is represented as a vector. Does this mean that each document corresponds to a matrix? And if so, how do you compute the minimum distance w.r.t. other text documents?
Question: How do you compute the distance between text documents for k-means with word2vec?
Edit: To explain my confusion in a bit more detail, please consider the following code:
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(sentences_tfidf)
print(tfidf_matrix.toarray())
model = Word2Vec(sentences_word2vec, min_count=1)
word2vec_matrix = model[model.wv.vocab]
print(len(word2vec_matrix))
for i in range(0,len(word2vec_matrix)):
print(X[i])
It returns the following code:
[[ 0. 0.55459491 0. 0. 0.35399075 0. 0.
0. 0. 0. 0. 0. 0. 0.437249
0.35399075 0.35399075 0.35399075 0. ]
[ 0. 0. 0. 0.44302215 0.2827753 0. 0.
0. 0.34928375 0. 0. 0. 0.34928375
0. 0.2827753 0.5655506 0.2827753 0. ]
[ 0. 0. 0.35101741 0. 0. 0.27674616
0.35101741 0. 0. 0.35101741 0. 0.35101741
0.27674616 0.27674616 0.44809973 0. 0. 0.27674616]
[ 0.40531999 0. 0. 0. 0.2587105 0.31955894
0. 0.40531999 0.31955894 0. 0.40531999 0. 0.
0. 0. 0.2587105 0.2587105 0.31955894]]
20
[ 4.08335682e-03 -4.44161100e-03 3.92342824e-03 3.96498619e-03
6.99949533e-06 -2.14108804e-04 1.20419310e-03 -1.29191438e-03
1.64671184e-03 3.41688609e-03 -4.94929403e-03 2.90348311e-03
4.23802016e-03 -3.01274913e-03 -7.36164337e-04 3.47558968e-03
-7.02908786e-04 4.73567843e-03 -1.42914290e-03 3.17237526e-03
9.36070050e-04 -2.23833631e-04 -4.03443904e-04 4.97530040e-04
-4.82502300e-03 2.42140982e-03 -3.61089432e-03 3.37070058e-04
-2.09900597e-03 -1.82093668e-03 -4.74618562e-03 2.41499138e-03
-2.15628324e-03 3.43719614e-03 7.50159554e-04 -2.05973233e-03
1.92534993e-03 1.96503079e-03 -2.02400610e-03 3.99564439e-03
4.95056808e-03 1.47033704e-03 -2.80071306e-03 3.59585625e-04
-2.77896033e-04 -3.21732066e-03 4.36303904e-03 -2.16396619e-03
2.24438333e-03 -4.50925855e-03 -4.70488053e-03 6.30825118e-04
3.81869613e-03 3.75767215e-03 5.01064525e-04 1.70175335e-03
-1.26033701e-04 -7.43318116e-04 -6.74833194e-04 -4.76678275e-03
1.53754558e-03 2.32421421e-03 -3.23472451e-03 -8.32759659e-04
4.67014220e-03 5.15853462e-04 -1.15449808e-03 -1.63017167e-03
-2.73897988e-03 -3.95627553e-03 4.04657237e-03 -1.79282576e-03
-3.26930732e-03 2.85121426e-03 -2.33304151e-03 -2.01760884e-03
-3.33597139e-03 -1.19233003e-03 -2.12347694e-03 4.36858647e-03
2.00414215e-03 -4.23572073e-03 4.98410035e-03 1.79121632e-03
4.81655030e-03 3.33247939e-03 -3.95260006e-03 1.19335402e-03
4.61675343e-04 6.09758368e-04 -4.74696746e-03 4.91552567e-03
1.74517138e-03 2.36604619e-03 -3.06009664e-04 3.62954312e-03
3.56943789e-03 2.92139384e-03 -4.27138479e-03 -3.51175456e-03]
[ -4.14272398e-03 3.45513038e-03 -1.47538856e-04 -2.02292087e-03
-2.96578306e-04 1.88684417e-03 -2.63865804e-03 2.69249966e-03
4.57606697e-03 2.19206396e-03 2.01336667e-03 1.47434452e-03
1.88332598e-03 -1.14452699e-03 -1.35678309e-03 -2.02636060e-04
-3.26160830e-03 -3.95368552e-03 1.40415027e-03 2.30542314e-03
-3.18884710e-03 -4.46776347e-03 3.96415358e-03 -2.07852037e-03
4.98413946e-03 -6.43568579e-04 -2.53325375e-03 1.30117545e-03
1.26555841e-03 -8.84680718e-04 -8.34991166e-04 -4.15050285e-03
4.66807076e-04 1.71844949e-04 1.08140183e-03 4.37910948e-03
-3.28412466e-03 2.09890743e-04 2.29888223e-03 4.70223464e-03
-2.31004297e-03 -5.10134443e-04 2.57104915e-03 -2.55978899e-03
-7.55646848e-04 -1.98197929e-04 1.20443532e-04 4.63618943e-03
1.13036349e-05 8.16594984e-04 -1.65917678e-03 3.29331891e-03
-4.97825304e-03 -2.03667139e-03 3.60272871e-03 7.44500838e-04
-4.40325850e-04 6.38399797e-04 -4.23364760e-03 -4.56386572e-03
4.77551389e-03 4.74880403e-03 7.06148741e-04 -1.24937459e-03
-9.50689311e-04 -3.88551364e-03 -4.45985980e-03 -1.15060725e-03
3.27067473e-03 4.54987818e-03 2.62327422e-03 -2.40981602e-03
4.55576897e-04 3.19155119e-03 -3.84227419e-03 -1.17610034e-03
-1.45622855e-03 -4.32460709e-03 -4.12792247e-03 -1.74557802e-03
4.66075348e-04 3.39668151e-03 -4.00651991e-03 1.41077011e-03
-7.89384532e-04 -6.56061340e-04 1.14822399e-03 4.12205653e-03
3.60721885e-03 -3.11746349e-04 1.44255662e-03 3.11965472e-03
-4.93455213e-03 4.80490318e-03 2.79991422e-03 4.93505970e-03
3.69034940e-03 4.76422161e-03 -1.25827035e-03 -1.94680784e-03]
...
[ -3.92252317e-04 -3.66805331e-03 1.52376946e-03 -3.81564132e-05
-2.57118000e-03 -4.46725264e-03 2.36480637e-03 -4.70252614e-03
-4.18651942e-03 4.54758806e-03 4.38804098e-04 1.28351408e-03
3.40470579e-03 1.00038981e-03 -1.06557179e-03 4.67202952e-03
4.50591929e-03 -2.67829909e-03 2.57702312e-03 -3.65824508e-03
-4.54068230e-03 2.20785337e-03 -1.00554363e-03 5.14690124e-04
4.64830594e-03 1.91410910e-03 -4.83837258e-03 6.73376708e-05
-2.37796479e-03 -4.45193471e-03 -2.60163331e-03 1.51159777e-03
4.06868104e-03 2.55690538e-04 -2.54662265e-03 2.64597777e-03
-2.62586889e-03 -2.71554058e-03 5.49281889e-04 -1.38776843e-03
-2.94354092e-03 -1.13887887e-03 4.59292997e-03 -1.02300232e-03
2.27600057e-03 -4.88117011e-03 1.95790920e-03 4.64376673e-04
2.56658648e-03 8.90390365e-04 -1.40368659e-03 -6.40658545e-04
-3.53228673e-03 -1.30717538e-03 -1.80223631e-03 2.94505036e-03
-4.82233381e-03 -2.16079340e-03 2.58940039e-03 1.60595961e-03
-1.22245611e-03 -6.72614493e-04 4.47060820e-03 -4.95934719e-03
2.70283176e-03 2.93257344e-03 2.13279200e-04 2.59435410e-03
2.98801321e-03 -2.79974379e-03 -1.49789048e-04 -2.53924704e-03
-7.83207070e-04 1.18357304e-03 -1.27669750e-03 -4.16665291e-03
1.40916929e-03 1.63017987e-07 1.36708119e-03 -1.26687710e-05
1.24729215e-03 -2.50442210e-03 -3.20308795e-03 -1.41550787e-03
-1.05747324e-03 -3.97984264e-03 2.25877413e-03 -1.28316227e-03
3.60359484e-03 -1.97929185e-04 3.21712159e-03 -4.96298913e-03
-1.83640339e-03 -9.90608009e-04 -2.03964626e-03 -4.87274351e-03
7.24950165e-04 3.85614252e-03 -4.18979349e-03 2.73840013e-03]
Using tfidf, k-means would be implemented by the lines
kmeans = KMeans(n_clusters = 5)
kmeans.fit(tfidf_matrix)
Using word2vec, k-means would be implemented by the lines
kmeans = KMeans(n_clusters = 5)
kmeans.fit(word2vec_matrix)
(Here's an example of k-means with word2vec). So in the first case, k-means gets a matrix with the tf-idf values of each word per document, while in the second case k-means gets a vector for each word. How can k-means cluster the documents in the second case if it just has the word2vec representations?

Since you are interested in clustering documents, probably the best you can do is to use the Doc2Vec package, which can prepare a vector for each one of your documents. Then you can apply any clustering algorithm to the set of your document vectors for further processing. If, for any reason, you want to use word vectors instead, there are a few things you can do. Here is a very simple method:
For each document, collect all words with the highest TF-IDF values w.r.t. that document.
Average the Word2Vec vectors of those words to create a vector for the whole document
Apply your clustering on the averaged vectors.
Don't try to average all the words in a document, it won't work.

genfromtxt return numpy array not separated by comma

I have a *.csv file that store two columns of float data.
I am using this function to import it but it generates the data not separated with comma.
data=np.genfromtxt("data.csv", delimiter=',', dtype=float)
output:
[[ 403.14915 150.560364 ]
[ 403.7822265 135.13165 ]
[ 404.5017 163.4669 ]
[ 434.02465 168.023224 ]
[ 373.7655 177.904114 ]
[ 450.608429 208.4187315]
[ 454.39475 239.9666595]
[ 453.8055 248.4082 ]
[ 457.5625305 247.70315 ]
[ 451.729431 258.19335 ]
[ 366.74405 225.169922 ]
[ 377.0055235 258.110077 ]
[ 380.3581 261.760071 ]
[ 383.98615 262.33805 ]
[ 388.2516785 272.715332 ]
[ 408.378174 200.9713135]]
How to format it to get a numpy array like
[[ 403.14915, 150.560364 ]
[ 403.7822265, 135.13165 ],....]
?

NumPy doesn't display commas when you print arrays. If you really want to see them, you can use
print(repr(data))
The repr function forces a str representation not ment for "nice" printing, but for the literal representation you would use yourself to type the data in your code.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Hierarchical Clustering with cosine similarity metric in fcluster package - linux

Related

Changing the values of matrix is changing the weights of the model

Difference in output of spacy nlp .vector when applied on sentence?

Log_probabilities returned by tf.nn.ctc_beam_search_decoder

How do you compute the distance between text documents for k-means with word2vec?

genfromtxt return numpy array not separated by comma

Categories

Resources