Difference in output of spacy nlp .vector when applied on sentence? - python-3.x

I am doing the following:
import spicy
nlp = spacy.load("en")
doc = nlp('Hello Stack Over Flow, my name is Steve')
doc.vector:
In [1]: doc = nlp('Hello Stack Over Flow, my name is Steve')
In [2]: doc.vector
Out[2]:
array([ 1.67874452e-02, 1.43885329e-01, -1.64147541e-01, -3.52525562e-02,
1.71078995e-01, 5.81666678e-02, 1.42294103e-02, -1.58536658e-01,
-1.17119223e-01, 1.00338888e+00, -1.03455082e-01, 5.80027774e-02,
5.08872233e-02, -2.64734793e-02, -4.76809964e-02, -3.61649990e-02,
-4.25985567e-02, 4.86545563e-01, -5.22996634e-02, 2.66118869e-02,
-7.14791119e-02, 2.33504437e-02, -1.01438001e-01, 1.78358995e-03,
6.41188920e-02, -1.93965547e-02, -1.72182247e-02, -4.99197766e-02,
3.82994451e-02, 2.89904438e-02, 1.10834874e-01, 1.07230783e-01,
1.72666041e-03, 9.85269994e-02, -2.64622234e-02, 1.47332232e-02,
1.49853658e-02, -3.25594470e-02, -2.28943750e-02, -6.28201067e-02,
-4.13866527e-03, 4.12439965e-02, -1.09200180e-03, -3.77365127e-02,
3.02788876e-02, -2.47912239e-02, -3.86282206e-02, -8.49756673e-02,
8.79433304e-02, -7.35666696e-03, -2.35625561e-02, 1.29868105e-01,
-8.24742168e-02, 3.79751101e-02, 6.52077794e-03, 4.12433175e-03,
-4.44555469e-03, -8.54532197e-02, 4.30566669e-02, -4.90945578e-02,
1.08687999e-02, -3.58653292e-02, 3.19277793e-02, 1.70548886e-01,
7.04367757e-02, -1.03306666e-01, -6.25603348e-02, -4.16669573e-05,
-9.90156457e-03, 4.87144403e-02, -6.59128875e-02, 2.21944507e-03,
6.23853356e-02, -1.16886329e-02, -2.20711138e-02, 1.35971338e-01,
5.85511066e-02, -2.78507806e-02, -4.42699976e-02, 1.22686662e-01,
-4.96295579e-02, 8.47733300e-03, -1.72136649e-02, 3.73593345e-02,
1.38313353e-01, -1.81285888e-01, 8.07836726e-02, -1.01186670e-01,
1.90296680e-01, -8.37400090e-03, -4.79855575e-02, 4.62987460e-02,
4.97333193e-03, 1.08253332e-02, 1.37178123e-01, -4.36927788e-02,
-9.02644824e-03, 2.52826661e-02, -2.60283332e-02, 7.33327791e-02,
-4.21555527e-02, -9.45088938e-02, -2.36399993e-02, -2.59645544e-02,
-1.17972204e-02, -7.21249953e-02, -1.62978880e-02, 4.46572453e-02,
8.05888604e-03, 1.73073336e-02, -1.11245394e-01, -1.35631096e-02,
4.26412188e-02, -1.24742221e-02, -4.93782237e-02, -3.84650044e-02,
9.32500139e-03, -2.58344412e-02, 5.39288903e-03, -2.51024440e-02,
-1.68177821e-02, 1.81681886e-02, 6.95144460e-02, 5.96744493e-02,
1.28178876e-02, 8.18611085e-02, 2.03688871e-02, -1.45592675e-01,
-2.97091678e-02, 1.67966553e-03, 2.56901123e-02, -1.57507751e-02,
-3.29821557e-02, 3.69144455e-02, 2.69458871e-02, -7.87097737e-02,
-3.22544426e-02, 9.35557822e-04, 2.51506642e-02, -1.39920013e-02,
-5.63631117e-01, 1.28184333e-01, 8.25011209e-02, 4.69026715e-02,
-2.58401129e-02, 3.11454497e-02, 7.81277791e-02, -1.18433349e-02,
2.19431128e-02, 2.38199951e-03, -2.19482221e-02, 5.75609989e-02,
1.32304668e-01, 4.28974479e-02, -1.32128010e-02, 4.54772264e-02,
-9.00077820e-02, -7.34564438e-02, -8.14672261e-02, -5.10835573e-02,
-3.27358916e-02, 2.09213328e-02, 5.85612208e-02, -2.49340013e-02,
-1.03430830e-01, -1.28346771e-01, 4.52880040e-02, 5.96577907e-03,
1.12773672e-01, -3.90797779e-02, -5.79966642e-02, 4.97789842e-05,
2.49000057e-03, -2.88800001e-02, -9.96003374e-02, 3.41123343e-02,
-3.62301096e-02, -7.10571110e-02, -5.67906946e-02, 4.61289100e-03,
7.72120059e-02, -1.36105552e-01, -6.25717789e-02, -8.04037750e-02,
2.12122276e-02, -6.30133413e-03, -9.87700000e-02, 6.31399453e-02,
-8.64481106e-02, -4.26407792e-02, -8.36099982e-02, 1.07030040e-02,
-1.34339988e-01, 6.82333438e-03, 5.62012270e-02, 6.89233318e-02,
5.61566688e-02, -9.32652280e-02, 6.18273281e-02, 1.12723336e-01,
-1.04766667e-01, -2.15716790e-02, -1.15266666e-01, 4.57017794e-02,
7.47987852e-02, -9.02220607e-04, 7.75654465e-02, -2.66306698e-02,
1.93627775e-02, -4.89100069e-03, -1.43213451e-01, -6.52845576e-02,
1.64663326e-02, -5.07618897e-02, -1.49422223e-02, 4.21274304e-02,
1.06691113e-02, -5.97029589e-02, -1.20738111e-01, -1.61822215e-02,
-5.95551059e-02, 3.67141105e-02, 2.88833342e-02, 5.24356700e-02,
7.51844468e-03, -3.79579999e-02, 9.96864438e-02, 1.28289998e-01,
1.56755541e-02, -1.55926663e-02, -4.89732213e-02, 2.24273317e-02,
-9.15533304e-03, 7.32631087e-02, -7.48946667e-02, -1.15108885e-01,
-5.56773357e-02, -8.49866867e-03, -3.00188921e-02, 3.55113335e-02,
-4.22161110e-02, 7.19971135e-02, 3.67489979e-02, -1.00055551e-02,
7.52926618e-02, -1.43726662e-01, -4.08722041e-03, -1.49663329e-01,
1.41400262e-03, 5.52397817e-02, 8.86320025e-02, -7.44862184e-02,
-3.23222089e-03, 3.30205560e-02, 3.77681069e-02, 6.58650026e-02,
2.83081792e-02, -3.24210003e-02, 1.93070006e-02, 5.67157790e-02,
6.17166609e-02, 1.09540010e-02, 4.71896678e-02, 7.68444464e-02,
-2.51592230e-02, -4.28744499e-03, -2.40004435e-02, 3.28795537e-02,
1.25606894e-01, -6.05716556e-02, 5.52507788e-02, -2.12161113e-02,
-8.45399946e-02, -7.95067847e-02, -1.33965556e-02, -5.02544455e-02,
-3.03339995e-02, 1.19719980e-02, 6.15093298e-02, 1.11455554e-02,
1.24445252e-01, 5.54273315e-02, 1.28475904e-01, -9.19478834e-02,
-2.29498874e-02, -4.18815538e-02, 5.02915531e-02, -1.14721097e-02,
1.06602885e-01, -8.45602229e-02, -4.17976640e-02, 1.39088994e-02,
-2.19033333e-03, 7.99388885e-02, 1.08606648e-02, -1.27933361e-02,
-2.84678000e-03, -2.97433343e-02, -8.61347839e-02, 9.06177703e-03],
dtype=float32)
But when I running the following I get:
In [3]: for token in doc: print("{} : {}".format(token, token.vector[:3]))
Hello : [0. 0. 0.]
Stack : [0. 0. 0.]
Over : [0. 0. 0.]
Flow : [0. 0. 0.]
, : [-0.082752 0.67204 -0.14987 ]
my : [ 0.08649 0.14503 -0.4902 ]
name : [ 0.23231 -0.024102 -0.83964 ]
is : [-0.084961 0.502 0.0023823]
Steve : [0. 0. 0.]
Please advise why do I get different representations?
The first vector is whole sentence representation?
Please explain me why do I get different vectors?

The solution is: A real-valued meaning representation. Defaults to an average of the token vectors.
Source: https://spacy.io/api/doc#vector
Hope it will help others too.

Related

For Friedman Test, what are the pros and cons of different post hoc comparison method?

scikit_posthocs lists some post hoc pairwise comparison methods if significant results of the Friedman’s test are obtained. How do I decide which one to use? In my case, the toy dataset df that I have is:
df = pd.DataFrame(
data=np.array([[ 581.125, 366.25 , 144.08 , 296.575],
[1743.4 , 900.85 , 338.925, 481.6 ],
[4335.75 , 2000.1 , 816.75 , 1236.75 ],
[ 241.925, 540.35 , 249.9 , 167.5 ],
[2822.4 , 1261.35 , 547.025, 626.05 ],
[ 568.32 , 652.55 , 277.34 , 265.68 ]]),
columns=['Both', 'Media Advertisement', 'Neither', 'Store Events'],
index=['Hardware Accessories', 'Pants', 'Shoes', 'Shorts', 'Sweatshirts',
'T-Shirts']
)
I first ran:
from scipy.stats import friedmanchisquare
print(friedmanchisquare(*df.T.values))
This gives me
FriedmanchisquareResult(statistic=12.200000000000003, pvalue=0.006728522930461357)
Now I need to do post hoc analysis, I first ran:
import scikit_posthocs as sp
sp.posthoc_conover_friedman(q1_friedman).style.applymap(lambda x: "background-color: yellow" if x < 0.05 else "")
It gives me this:
I am confused which post hoc analysis approach I should choose to meet my analytic need. In post hocs API, there are posthoc_nemenyi_friedman, posthoc_conover_friedman, posthoc_siegel_friedman and posthoc_miller_friedman for me to choose from. I tried them all and they returned some conclusions that may slightly differ from one another. Statistically speaking, which approach is appropriate for my case?

Changing the values of matrix is changing the weights of the model

I am working with neural network weights and I am seeing a weird thing. I have written this code:
x = list(mnist_classifier.named_parameters())
weight = x[0][1].detach().cpu().numpy().squeeze()
print(weight)
So I get the following values:
[[[-0.2435195 0.05255396 -0.32765684]
[ 0.06372751 0.03564635 -0.31417745]
[ 0.14694464 -0.03277654 -0.10328879]]
[[-0.13716389 0.0128522 0.24107361]
[ 0.45231998 0.15497956 0.11112727]
[ 0.18206735 -0.22820294 -0.29146808]]
[[ 1.1747813 0.9206593 0.49848938]
[ 1.1558323 1.0859997 0.7743778 ]
[ 1.0287125 0.52122927 0.4096022 ]]
[[-0.2980809 -0.04358199 -0.26461622]
[-0.1165191 -0.2267315 0.37054354]
[ 0.4429275 0.44967037 0.06866694]]
[[ 0.39549246 0.10898255 0.32859102]
[-0.07753246 0.1628792 0.03021396]
[ 0.323148 0.5103844 0.16282919]]
....
Now, when I change the value of the first matrix weight[0] to 0.1, it changes the values of the original weights:
x = list(mnist_classifier.named_parameters())
weight = x[0][1].detach().cpu().numpy().squeeze()
weight[0] = weight[0] * 0 + 0.1
print(list(mnist_classifier.named_parameters()))
[('conv1.weight', Parameter containing:
tensor([[[[ 0.1000, 0.1000, 0.1000],
[ 0.1000, 0.1000, 0.1000],
[ 0.1000, 0.1000, 0.1000]]],
[[[-0.1372, 0.0129, 0.2411],
[ 0.4523, 0.1550, 0.1111],
[ 0.1821, -0.2282, -0.2915]]],
[[[ 1.1748, 0.9207, 0.4985],
[ 1.1558, 1.0860, 0.7744],
[ 1.0287, 0.5212, 0.4096]]],
...
What is going on here? How is weight[0] connected to the neural network?
I found the answer. Apparently, when copying np arrays, you are supposed to use copy() otherwise it's a pass-by reference. So using copy() helped.

How do you compute the distance between text documents for k-means with word2vec?

I have recently been introduced to word2vec and I'm having some trouble trying to figure out how exactly it is used for k-means clustering.
I do understand how k-means works with tf-idf vectors. For each text document you have a vector of tf-idf values and after choosing some documents as initial cluster centers, you can use the euclidian distance to minimise the the distances between the vectors of the documents. Here's an example.
However, when using word2vec, each word is represented as a vector. Does this mean that each document corresponds to a matrix? And if so, how do you compute the minimum distance w.r.t. other text documents?
Question: How do you compute the distance between text documents for k-means with word2vec?
Edit: To explain my confusion in a bit more detail, please consider the following code:
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(sentences_tfidf)
print(tfidf_matrix.toarray())
model = Word2Vec(sentences_word2vec, min_count=1)
word2vec_matrix = model[model.wv.vocab]
print(len(word2vec_matrix))
for i in range(0,len(word2vec_matrix)):
print(X[i])
It returns the following code:
[[ 0. 0.55459491 0. 0. 0.35399075 0. 0.
0. 0. 0. 0. 0. 0. 0.437249
0.35399075 0.35399075 0.35399075 0. ]
[ 0. 0. 0. 0.44302215 0.2827753 0. 0.
0. 0.34928375 0. 0. 0. 0.34928375
0. 0.2827753 0.5655506 0.2827753 0. ]
[ 0. 0. 0.35101741 0. 0. 0.27674616
0.35101741 0. 0. 0.35101741 0. 0.35101741
0.27674616 0.27674616 0.44809973 0. 0. 0.27674616]
[ 0.40531999 0. 0. 0. 0.2587105 0.31955894
0. 0.40531999 0.31955894 0. 0.40531999 0. 0.
0. 0. 0.2587105 0.2587105 0.31955894]]
20
[ 4.08335682e-03 -4.44161100e-03 3.92342824e-03 3.96498619e-03
6.99949533e-06 -2.14108804e-04 1.20419310e-03 -1.29191438e-03
1.64671184e-03 3.41688609e-03 -4.94929403e-03 2.90348311e-03
4.23802016e-03 -3.01274913e-03 -7.36164337e-04 3.47558968e-03
-7.02908786e-04 4.73567843e-03 -1.42914290e-03 3.17237526e-03
9.36070050e-04 -2.23833631e-04 -4.03443904e-04 4.97530040e-04
-4.82502300e-03 2.42140982e-03 -3.61089432e-03 3.37070058e-04
-2.09900597e-03 -1.82093668e-03 -4.74618562e-03 2.41499138e-03
-2.15628324e-03 3.43719614e-03 7.50159554e-04 -2.05973233e-03
1.92534993e-03 1.96503079e-03 -2.02400610e-03 3.99564439e-03
4.95056808e-03 1.47033704e-03 -2.80071306e-03 3.59585625e-04
-2.77896033e-04 -3.21732066e-03 4.36303904e-03 -2.16396619e-03
2.24438333e-03 -4.50925855e-03 -4.70488053e-03 6.30825118e-04
3.81869613e-03 3.75767215e-03 5.01064525e-04 1.70175335e-03
-1.26033701e-04 -7.43318116e-04 -6.74833194e-04 -4.76678275e-03
1.53754558e-03 2.32421421e-03 -3.23472451e-03 -8.32759659e-04
4.67014220e-03 5.15853462e-04 -1.15449808e-03 -1.63017167e-03
-2.73897988e-03 -3.95627553e-03 4.04657237e-03 -1.79282576e-03
-3.26930732e-03 2.85121426e-03 -2.33304151e-03 -2.01760884e-03
-3.33597139e-03 -1.19233003e-03 -2.12347694e-03 4.36858647e-03
2.00414215e-03 -4.23572073e-03 4.98410035e-03 1.79121632e-03
4.81655030e-03 3.33247939e-03 -3.95260006e-03 1.19335402e-03
4.61675343e-04 6.09758368e-04 -4.74696746e-03 4.91552567e-03
1.74517138e-03 2.36604619e-03 -3.06009664e-04 3.62954312e-03
3.56943789e-03 2.92139384e-03 -4.27138479e-03 -3.51175456e-03]
[ -4.14272398e-03 3.45513038e-03 -1.47538856e-04 -2.02292087e-03
-2.96578306e-04 1.88684417e-03 -2.63865804e-03 2.69249966e-03
4.57606697e-03 2.19206396e-03 2.01336667e-03 1.47434452e-03
1.88332598e-03 -1.14452699e-03 -1.35678309e-03 -2.02636060e-04
-3.26160830e-03 -3.95368552e-03 1.40415027e-03 2.30542314e-03
-3.18884710e-03 -4.46776347e-03 3.96415358e-03 -2.07852037e-03
4.98413946e-03 -6.43568579e-04 -2.53325375e-03 1.30117545e-03
1.26555841e-03 -8.84680718e-04 -8.34991166e-04 -4.15050285e-03
4.66807076e-04 1.71844949e-04 1.08140183e-03 4.37910948e-03
-3.28412466e-03 2.09890743e-04 2.29888223e-03 4.70223464e-03
-2.31004297e-03 -5.10134443e-04 2.57104915e-03 -2.55978899e-03
-7.55646848e-04 -1.98197929e-04 1.20443532e-04 4.63618943e-03
1.13036349e-05 8.16594984e-04 -1.65917678e-03 3.29331891e-03
-4.97825304e-03 -2.03667139e-03 3.60272871e-03 7.44500838e-04
-4.40325850e-04 6.38399797e-04 -4.23364760e-03 -4.56386572e-03
4.77551389e-03 4.74880403e-03 7.06148741e-04 -1.24937459e-03
-9.50689311e-04 -3.88551364e-03 -4.45985980e-03 -1.15060725e-03
3.27067473e-03 4.54987818e-03 2.62327422e-03 -2.40981602e-03
4.55576897e-04 3.19155119e-03 -3.84227419e-03 -1.17610034e-03
-1.45622855e-03 -4.32460709e-03 -4.12792247e-03 -1.74557802e-03
4.66075348e-04 3.39668151e-03 -4.00651991e-03 1.41077011e-03
-7.89384532e-04 -6.56061340e-04 1.14822399e-03 4.12205653e-03
3.60721885e-03 -3.11746349e-04 1.44255662e-03 3.11965472e-03
-4.93455213e-03 4.80490318e-03 2.79991422e-03 4.93505970e-03
3.69034940e-03 4.76422161e-03 -1.25827035e-03 -1.94680784e-03]
...
[ -3.92252317e-04 -3.66805331e-03 1.52376946e-03 -3.81564132e-05
-2.57118000e-03 -4.46725264e-03 2.36480637e-03 -4.70252614e-03
-4.18651942e-03 4.54758806e-03 4.38804098e-04 1.28351408e-03
3.40470579e-03 1.00038981e-03 -1.06557179e-03 4.67202952e-03
4.50591929e-03 -2.67829909e-03 2.57702312e-03 -3.65824508e-03
-4.54068230e-03 2.20785337e-03 -1.00554363e-03 5.14690124e-04
4.64830594e-03 1.91410910e-03 -4.83837258e-03 6.73376708e-05
-2.37796479e-03 -4.45193471e-03 -2.60163331e-03 1.51159777e-03
4.06868104e-03 2.55690538e-04 -2.54662265e-03 2.64597777e-03
-2.62586889e-03 -2.71554058e-03 5.49281889e-04 -1.38776843e-03
-2.94354092e-03 -1.13887887e-03 4.59292997e-03 -1.02300232e-03
2.27600057e-03 -4.88117011e-03 1.95790920e-03 4.64376673e-04
2.56658648e-03 8.90390365e-04 -1.40368659e-03 -6.40658545e-04
-3.53228673e-03 -1.30717538e-03 -1.80223631e-03 2.94505036e-03
-4.82233381e-03 -2.16079340e-03 2.58940039e-03 1.60595961e-03
-1.22245611e-03 -6.72614493e-04 4.47060820e-03 -4.95934719e-03
2.70283176e-03 2.93257344e-03 2.13279200e-04 2.59435410e-03
2.98801321e-03 -2.79974379e-03 -1.49789048e-04 -2.53924704e-03
-7.83207070e-04 1.18357304e-03 -1.27669750e-03 -4.16665291e-03
1.40916929e-03 1.63017987e-07 1.36708119e-03 -1.26687710e-05
1.24729215e-03 -2.50442210e-03 -3.20308795e-03 -1.41550787e-03
-1.05747324e-03 -3.97984264e-03 2.25877413e-03 -1.28316227e-03
3.60359484e-03 -1.97929185e-04 3.21712159e-03 -4.96298913e-03
-1.83640339e-03 -9.90608009e-04 -2.03964626e-03 -4.87274351e-03
7.24950165e-04 3.85614252e-03 -4.18979349e-03 2.73840013e-03]
Using tfidf, k-means would be implemented by the lines
kmeans = KMeans(n_clusters = 5)
kmeans.fit(tfidf_matrix)
Using word2vec, k-means would be implemented by the lines
kmeans = KMeans(n_clusters = 5)
kmeans.fit(word2vec_matrix)
(Here's an example of k-means with word2vec). So in the first case, k-means gets a matrix with the tf-idf values of each word per document, while in the second case k-means gets a vector for each word. How can k-means cluster the documents in the second case if it just has the word2vec representations?
Since you are interested in clustering documents, probably the best you can do is to use the Doc2Vec package, which can prepare a vector for each one of your documents. Then you can apply any clustering algorithm to the set of your document vectors for further processing. If, for any reason, you want to use word vectors instead, there are a few things you can do. Here is a very simple method:
For each document, collect all words with the highest TF-IDF values w.r.t. that document.
Average the Word2Vec vectors of those words to create a vector for the whole document
Apply your clustering on the averaged vectors.
Don't try to average all the words in a document, it won't work.

Random Forest feature importance: how many are actually used?

I use RF twice in a row.
First, I fit it using max_features='auto' and the whole dataset (109 feature), in order to perform features selection.
The following is RandomForestClassifier.feature_importances_, it correctly gives me 109 score per each feature:
[0.00118087, 0.01268531, 0.0017589 , 0.01614814, 0.01105567,
0.0146838 , 0.0187875 , 0.0190427 , 0.01429976, 0.01311706,
0.01702717, 0.00901344, 0.01044047, 0.00932331, 0.01211333,
0.01271825, 0.0095337 , 0.00985686, 0.00952823, 0.01165877,
0.00193286, 0.0012602 , 0.00208145, 0.00203459, 0.00229907,
0.00242616, 0.00051358, 0.00071606, 0.00975515, 0.00171034,
0.01134927, 0.00687018, 0.00987706, 0.01507474, 0.01223525,
0.01170495, 0.00928417, 0.01083082, 0.01302036, 0.01002457,
0.00894818, 0.00833564, 0.00930602, 0.01100774, 0.00818604,
0.00675784, 0.00740617, 0.00185461, 0.00119627, 0.00159034,
0.00154336, 0.00478926, 0.00200773, 0.00063574, 0.00065675,
0.01104192, 0.00246746, 0.01663812, 0.01041134, 0.01401842,
0.02038318, 0.0202834 , 0.01290935, 0.01476593, 0.0108275 ,
0.0118773 , 0.01050919, 0.0111477 , 0.00684507, 0.01170021,
0.01291888, 0.00963295, 0.01161876, 0.00756015, 0.00178329,
0.00065709, 0. , 0.00246064, 0.00217982, 0.00305187,
0.00061284, 0.00063431, 0.01963523, 0.00265208, 0.01543552,
0.0176546 , 0.01443356, 0.01834896, 0.01385694, 0.01320648,
0.00966011, 0.0148321 , 0.01574166, 0.0167107 , 0.00791634,
0.01121442, 0.02171706, 0.01855552, 0.0257449 , 0.02925843,
0.01789742, 0. , 0. , 0.00379275, 0.0024365 ,
0.00333905, 0.00238971, 0.00068355, 0.00075399]
Then, I transform the dataset over the previous fit which should reduce its dimensionality, and then i re-fit RF over it.
Given max_features='auto' and the 109 feats, I would expect to have in total ~10 features instead, calling rf.feats_importance_, returns more (62):
[ 0.01261971, 0.02003921, 0.00961297, 0.02505467, 0.02038449,
0.02353745, 0.01893777, 0.01932577, 0.01681398, 0.01464485,
0.01672119, 0.00748981, 0.01109461, 0.01116948, 0.0087081 ,
0.01056344, 0.00971319, 0.01532258, 0.0167348 , 0.01601214,
0.01522208, 0.01625487, 0.01653784, 0.01483562, 0.01602748,
0.01522369, 0.01581573, 0.01406688, 0.01269036, 0.00884105,
0.02538574, 0.00637611, 0.01928382, 0.02061512, 0.02566056,
0.02180902, 0.01537295, 0.01796305, 0.01171095, 0.01179759,
0.01371328, 0.00811729, 0.01060708, 0.015717 , 0.01067911,
0.01773623, 0.0169396 , 0.0226369 , 0.01547827, 0.01499467,
0.01356075, 0.01040735, 0.01360752, 0.01754145, 0.01446933,
0.01845195, 0.0190799 , 0.02608652, 0.02095663, 0.02939744,
0.01870901, 0.02512201]
Why? Shouldn't it returns just ~10 features importances?
You misunderstood the meaning of max_features, which is
The number of features to consider when looking for the best split
It is not the number of features when transforming the data.
It is the threshold in transform method that determines the most important features.
threshold : string, float or None, optional (default=None)
The threshold value to use for feature selection. Features whose importance is greater or equal are kept while the others are discarded. If “median” (resp. “mean”), then the threshold value is the median (resp. the mean) of the feature importances. A scaling factor (e.g., “1.25*mean”) may also be used. If None and if available, the object attribute threshold is used. Otherwise, “mean” is used by default.

Hierarchical Clustering with cosine similarity metric in fcluster package

I use scipy.cluster.hierarchy to do a hierarchical clustering on a set of points using "cosine" similarity metric. As an example, I have:
import scipy.cluster.hierarchy as hac
import matplotlib.pyplot as plt
Points =
np.array([[ 0. , 0.23508573],
[ 0.00754775 , 0.26717266],
[ 0.00595464 , 0.27775905],
[ 0.01220563 , 0.23622067],
[ 0.00542628 , 0.14185873],
[ 0.03078922 , 0.11273108],
[ 0.06707743 ,-0.1061131 ],
[ 0.04411757 ,-0.10775407],
[ 0.01349434 , 0.00112159],
[ 0.04066034 , 0.11639591],
[ 0. , 0.29046682],
[ 0.07338036 , 0.00609912],
[ 0.01864988 , 0.0316196 ],
[ 0. , 0.07270636],
[ 0. , 0. ]])
z = hac.linkage(Points, metric='cosine', method='complete')
labels = hac.fcluster(z, 0.1, criterion="distance")
plt.scatter(Points[:, 0], Points[:, 1], c=labels.astype(np.float))
plt.show()
Since I use cosine metric, in some cases the dot product of two vectors can be negative or norm of some vectors can be zero. It means z output will have some negative or infinite elements which is not valid for fcluster (as below):
z =
[[ 0.00000000e+00 1.00000000e+01 0.00000000e+00 2.00000000e+00]
[ 1.30000000e+01 1.50000000e+01 0.00000000e+00 3.00000000e+00]
[ 8.00000000e+00 1.10000000e+01 4.26658708e-13 2.00000000e+00]
[ 1.00000000e+00 2.00000000e+00 2.31748880e-05 2.00000000e+00]
[ 3.00000000e+00 4.00000000e+00 8.96700489e-05 2.00000000e+00]
[ 1.60000000e+01 1.80000000e+01 3.98805492e-04 5.00000000e+00]
[ 1.90000000e+01 2.00000000e+01 1.33225099e-03 7.00000000e+00]
[ 5.00000000e+00 9.00000000e+00 2.41120340e-03 2.00000000e+00]
[ 6.00000000e+00 7.00000000e+00 1.52914684e-02 2.00000000e+00]
[ 1.20000000e+01 2.20000000e+01 3.52441432e-02 3.00000000e+00]
[ 2.10000000e+01 2.40000000e+01 1.38662986e-01 1.00000000e+01]
[ 1.70000000e+01 2.30000000e+01 6.99056531e-01 4.00000000e+00]
[ 2.50000000e+01 2.60000000e+01 1.92543748e+00 1.40000000e+01]
[ -1.00000000e+00 2.70000000e+01 inf 1.50000000e+01]]
To solve this problem, I checked linkage() function and inside it I needed to check _hierarchy.linkage() method. I use pycharm text editor and when I asked for "linkage" source code, it opened up a python file namely "_hierarchy.py" inside the directory like the following:
.PyCharm40/system/python_stubs/-1247972723/scipy/cluster/_hierarchy.py
This python file doesn't have any definition for all included functions.
I am wondering what is the correct source of this function to revise it or is there another way to solve this problem.
I would be appreciated for your helps and hints.
You have a zero vector 0 0 in your data set. For such data, cosine distance is undefined, so you are using an inappropriate distance function!
This is a definition gap that cannot be trivially closed. inf is as incorrect as 0. The distance to 0 0 with cosine cannot be defined without contraditions. You must not use cosine on such data.
Back to your actual question: _hierarchy is a Cython module. It is not pure python, but it is compiled to native code. You can easily see the source code on Github:
https://github.com/scipy/scipy/blob/master/scipy/cluster/_hierarchy.pyx

Resources