How do you compute the distance between text documents for k-means with word2vec? - nlp

I have recently been introduced to word2vec and I'm having some trouble trying to figure out how exactly it is used for k-means clustering.
I do understand how k-means works with tf-idf vectors. For each text document you have a vector of tf-idf values and after choosing some documents as initial cluster centers, you can use the euclidian distance to minimise the the distances between the vectors of the documents. Here's an example.
However, when using word2vec, each word is represented as a vector. Does this mean that each document corresponds to a matrix? And if so, how do you compute the minimum distance w.r.t. other text documents?
Question: How do you compute the distance between text documents for k-means with word2vec?
Edit: To explain my confusion in a bit more detail, please consider the following code:
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(sentences_tfidf)
print(tfidf_matrix.toarray())
model = Word2Vec(sentences_word2vec, min_count=1)
word2vec_matrix = model[model.wv.vocab]
print(len(word2vec_matrix))
for i in range(0,len(word2vec_matrix)):
print(X[i])
It returns the following code:
[[ 0. 0.55459491 0. 0. 0.35399075 0. 0.
0. 0. 0. 0. 0. 0. 0.437249
0.35399075 0.35399075 0.35399075 0. ]
[ 0. 0. 0. 0.44302215 0.2827753 0. 0.
0. 0.34928375 0. 0. 0. 0.34928375
0. 0.2827753 0.5655506 0.2827753 0. ]
[ 0. 0. 0.35101741 0. 0. 0.27674616
0.35101741 0. 0. 0.35101741 0. 0.35101741
0.27674616 0.27674616 0.44809973 0. 0. 0.27674616]
[ 0.40531999 0. 0. 0. 0.2587105 0.31955894
0. 0.40531999 0.31955894 0. 0.40531999 0. 0.
0. 0. 0.2587105 0.2587105 0.31955894]]
20
[ 4.08335682e-03 -4.44161100e-03 3.92342824e-03 3.96498619e-03
6.99949533e-06 -2.14108804e-04 1.20419310e-03 -1.29191438e-03
1.64671184e-03 3.41688609e-03 -4.94929403e-03 2.90348311e-03
4.23802016e-03 -3.01274913e-03 -7.36164337e-04 3.47558968e-03
-7.02908786e-04 4.73567843e-03 -1.42914290e-03 3.17237526e-03
9.36070050e-04 -2.23833631e-04 -4.03443904e-04 4.97530040e-04
-4.82502300e-03 2.42140982e-03 -3.61089432e-03 3.37070058e-04
-2.09900597e-03 -1.82093668e-03 -4.74618562e-03 2.41499138e-03
-2.15628324e-03 3.43719614e-03 7.50159554e-04 -2.05973233e-03
1.92534993e-03 1.96503079e-03 -2.02400610e-03 3.99564439e-03
4.95056808e-03 1.47033704e-03 -2.80071306e-03 3.59585625e-04
-2.77896033e-04 -3.21732066e-03 4.36303904e-03 -2.16396619e-03
2.24438333e-03 -4.50925855e-03 -4.70488053e-03 6.30825118e-04
3.81869613e-03 3.75767215e-03 5.01064525e-04 1.70175335e-03
-1.26033701e-04 -7.43318116e-04 -6.74833194e-04 -4.76678275e-03
1.53754558e-03 2.32421421e-03 -3.23472451e-03 -8.32759659e-04
4.67014220e-03 5.15853462e-04 -1.15449808e-03 -1.63017167e-03
-2.73897988e-03 -3.95627553e-03 4.04657237e-03 -1.79282576e-03
-3.26930732e-03 2.85121426e-03 -2.33304151e-03 -2.01760884e-03
-3.33597139e-03 -1.19233003e-03 -2.12347694e-03 4.36858647e-03
2.00414215e-03 -4.23572073e-03 4.98410035e-03 1.79121632e-03
4.81655030e-03 3.33247939e-03 -3.95260006e-03 1.19335402e-03
4.61675343e-04 6.09758368e-04 -4.74696746e-03 4.91552567e-03
1.74517138e-03 2.36604619e-03 -3.06009664e-04 3.62954312e-03
3.56943789e-03 2.92139384e-03 -4.27138479e-03 -3.51175456e-03]
[ -4.14272398e-03 3.45513038e-03 -1.47538856e-04 -2.02292087e-03
-2.96578306e-04 1.88684417e-03 -2.63865804e-03 2.69249966e-03
4.57606697e-03 2.19206396e-03 2.01336667e-03 1.47434452e-03
1.88332598e-03 -1.14452699e-03 -1.35678309e-03 -2.02636060e-04
-3.26160830e-03 -3.95368552e-03 1.40415027e-03 2.30542314e-03
-3.18884710e-03 -4.46776347e-03 3.96415358e-03 -2.07852037e-03
4.98413946e-03 -6.43568579e-04 -2.53325375e-03 1.30117545e-03
1.26555841e-03 -8.84680718e-04 -8.34991166e-04 -4.15050285e-03
4.66807076e-04 1.71844949e-04 1.08140183e-03 4.37910948e-03
-3.28412466e-03 2.09890743e-04 2.29888223e-03 4.70223464e-03
-2.31004297e-03 -5.10134443e-04 2.57104915e-03 -2.55978899e-03
-7.55646848e-04 -1.98197929e-04 1.20443532e-04 4.63618943e-03
1.13036349e-05 8.16594984e-04 -1.65917678e-03 3.29331891e-03
-4.97825304e-03 -2.03667139e-03 3.60272871e-03 7.44500838e-04
-4.40325850e-04 6.38399797e-04 -4.23364760e-03 -4.56386572e-03
4.77551389e-03 4.74880403e-03 7.06148741e-04 -1.24937459e-03
-9.50689311e-04 -3.88551364e-03 -4.45985980e-03 -1.15060725e-03
3.27067473e-03 4.54987818e-03 2.62327422e-03 -2.40981602e-03
4.55576897e-04 3.19155119e-03 -3.84227419e-03 -1.17610034e-03
-1.45622855e-03 -4.32460709e-03 -4.12792247e-03 -1.74557802e-03
4.66075348e-04 3.39668151e-03 -4.00651991e-03 1.41077011e-03
-7.89384532e-04 -6.56061340e-04 1.14822399e-03 4.12205653e-03
3.60721885e-03 -3.11746349e-04 1.44255662e-03 3.11965472e-03
-4.93455213e-03 4.80490318e-03 2.79991422e-03 4.93505970e-03
3.69034940e-03 4.76422161e-03 -1.25827035e-03 -1.94680784e-03]
...
[ -3.92252317e-04 -3.66805331e-03 1.52376946e-03 -3.81564132e-05
-2.57118000e-03 -4.46725264e-03 2.36480637e-03 -4.70252614e-03
-4.18651942e-03 4.54758806e-03 4.38804098e-04 1.28351408e-03
3.40470579e-03 1.00038981e-03 -1.06557179e-03 4.67202952e-03
4.50591929e-03 -2.67829909e-03 2.57702312e-03 -3.65824508e-03
-4.54068230e-03 2.20785337e-03 -1.00554363e-03 5.14690124e-04
4.64830594e-03 1.91410910e-03 -4.83837258e-03 6.73376708e-05
-2.37796479e-03 -4.45193471e-03 -2.60163331e-03 1.51159777e-03
4.06868104e-03 2.55690538e-04 -2.54662265e-03 2.64597777e-03
-2.62586889e-03 -2.71554058e-03 5.49281889e-04 -1.38776843e-03
-2.94354092e-03 -1.13887887e-03 4.59292997e-03 -1.02300232e-03
2.27600057e-03 -4.88117011e-03 1.95790920e-03 4.64376673e-04
2.56658648e-03 8.90390365e-04 -1.40368659e-03 -6.40658545e-04
-3.53228673e-03 -1.30717538e-03 -1.80223631e-03 2.94505036e-03
-4.82233381e-03 -2.16079340e-03 2.58940039e-03 1.60595961e-03
-1.22245611e-03 -6.72614493e-04 4.47060820e-03 -4.95934719e-03
2.70283176e-03 2.93257344e-03 2.13279200e-04 2.59435410e-03
2.98801321e-03 -2.79974379e-03 -1.49789048e-04 -2.53924704e-03
-7.83207070e-04 1.18357304e-03 -1.27669750e-03 -4.16665291e-03
1.40916929e-03 1.63017987e-07 1.36708119e-03 -1.26687710e-05
1.24729215e-03 -2.50442210e-03 -3.20308795e-03 -1.41550787e-03
-1.05747324e-03 -3.97984264e-03 2.25877413e-03 -1.28316227e-03
3.60359484e-03 -1.97929185e-04 3.21712159e-03 -4.96298913e-03
-1.83640339e-03 -9.90608009e-04 -2.03964626e-03 -4.87274351e-03
7.24950165e-04 3.85614252e-03 -4.18979349e-03 2.73840013e-03]
Using tfidf, k-means would be implemented by the lines
kmeans = KMeans(n_clusters = 5)
kmeans.fit(tfidf_matrix)
Using word2vec, k-means would be implemented by the lines
kmeans = KMeans(n_clusters = 5)
kmeans.fit(word2vec_matrix)
(Here's an example of k-means with word2vec). So in the first case, k-means gets a matrix with the tf-idf values of each word per document, while in the second case k-means gets a vector for each word. How can k-means cluster the documents in the second case if it just has the word2vec representations?

Since you are interested in clustering documents, probably the best you can do is to use the Doc2Vec package, which can prepare a vector for each one of your documents. Then you can apply any clustering algorithm to the set of your document vectors for further processing. If, for any reason, you want to use word vectors instead, there are a few things you can do. Here is a very simple method:
For each document, collect all words with the highest TF-IDF values w.r.t. that document.
Average the Word2Vec vectors of those words to create a vector for the whole document
Apply your clustering on the averaged vectors.
Don't try to average all the words in a document, it won't work.

Related

Difference in output of spacy nlp .vector when applied on sentence?

I am doing the following:
import spicy
nlp = spacy.load("en")
doc = nlp('Hello Stack Over Flow, my name is Steve')
doc.vector:
In [1]: doc = nlp('Hello Stack Over Flow, my name is Steve')
In [2]: doc.vector
Out[2]:
array([ 1.67874452e-02, 1.43885329e-01, -1.64147541e-01, -3.52525562e-02,
1.71078995e-01, 5.81666678e-02, 1.42294103e-02, -1.58536658e-01,
-1.17119223e-01, 1.00338888e+00, -1.03455082e-01, 5.80027774e-02,
5.08872233e-02, -2.64734793e-02, -4.76809964e-02, -3.61649990e-02,
-4.25985567e-02, 4.86545563e-01, -5.22996634e-02, 2.66118869e-02,
-7.14791119e-02, 2.33504437e-02, -1.01438001e-01, 1.78358995e-03,
6.41188920e-02, -1.93965547e-02, -1.72182247e-02, -4.99197766e-02,
3.82994451e-02, 2.89904438e-02, 1.10834874e-01, 1.07230783e-01,
1.72666041e-03, 9.85269994e-02, -2.64622234e-02, 1.47332232e-02,
1.49853658e-02, -3.25594470e-02, -2.28943750e-02, -6.28201067e-02,
-4.13866527e-03, 4.12439965e-02, -1.09200180e-03, -3.77365127e-02,
3.02788876e-02, -2.47912239e-02, -3.86282206e-02, -8.49756673e-02,
8.79433304e-02, -7.35666696e-03, -2.35625561e-02, 1.29868105e-01,
-8.24742168e-02, 3.79751101e-02, 6.52077794e-03, 4.12433175e-03,
-4.44555469e-03, -8.54532197e-02, 4.30566669e-02, -4.90945578e-02,
1.08687999e-02, -3.58653292e-02, 3.19277793e-02, 1.70548886e-01,
7.04367757e-02, -1.03306666e-01, -6.25603348e-02, -4.16669573e-05,
-9.90156457e-03, 4.87144403e-02, -6.59128875e-02, 2.21944507e-03,
6.23853356e-02, -1.16886329e-02, -2.20711138e-02, 1.35971338e-01,
5.85511066e-02, -2.78507806e-02, -4.42699976e-02, 1.22686662e-01,
-4.96295579e-02, 8.47733300e-03, -1.72136649e-02, 3.73593345e-02,
1.38313353e-01, -1.81285888e-01, 8.07836726e-02, -1.01186670e-01,
1.90296680e-01, -8.37400090e-03, -4.79855575e-02, 4.62987460e-02,
4.97333193e-03, 1.08253332e-02, 1.37178123e-01, -4.36927788e-02,
-9.02644824e-03, 2.52826661e-02, -2.60283332e-02, 7.33327791e-02,
-4.21555527e-02, -9.45088938e-02, -2.36399993e-02, -2.59645544e-02,
-1.17972204e-02, -7.21249953e-02, -1.62978880e-02, 4.46572453e-02,
8.05888604e-03, 1.73073336e-02, -1.11245394e-01, -1.35631096e-02,
4.26412188e-02, -1.24742221e-02, -4.93782237e-02, -3.84650044e-02,
9.32500139e-03, -2.58344412e-02, 5.39288903e-03, -2.51024440e-02,
-1.68177821e-02, 1.81681886e-02, 6.95144460e-02, 5.96744493e-02,
1.28178876e-02, 8.18611085e-02, 2.03688871e-02, -1.45592675e-01,
-2.97091678e-02, 1.67966553e-03, 2.56901123e-02, -1.57507751e-02,
-3.29821557e-02, 3.69144455e-02, 2.69458871e-02, -7.87097737e-02,
-3.22544426e-02, 9.35557822e-04, 2.51506642e-02, -1.39920013e-02,
-5.63631117e-01, 1.28184333e-01, 8.25011209e-02, 4.69026715e-02,
-2.58401129e-02, 3.11454497e-02, 7.81277791e-02, -1.18433349e-02,
2.19431128e-02, 2.38199951e-03, -2.19482221e-02, 5.75609989e-02,
1.32304668e-01, 4.28974479e-02, -1.32128010e-02, 4.54772264e-02,
-9.00077820e-02, -7.34564438e-02, -8.14672261e-02, -5.10835573e-02,
-3.27358916e-02, 2.09213328e-02, 5.85612208e-02, -2.49340013e-02,
-1.03430830e-01, -1.28346771e-01, 4.52880040e-02, 5.96577907e-03,
1.12773672e-01, -3.90797779e-02, -5.79966642e-02, 4.97789842e-05,
2.49000057e-03, -2.88800001e-02, -9.96003374e-02, 3.41123343e-02,
-3.62301096e-02, -7.10571110e-02, -5.67906946e-02, 4.61289100e-03,
7.72120059e-02, -1.36105552e-01, -6.25717789e-02, -8.04037750e-02,
2.12122276e-02, -6.30133413e-03, -9.87700000e-02, 6.31399453e-02,
-8.64481106e-02, -4.26407792e-02, -8.36099982e-02, 1.07030040e-02,
-1.34339988e-01, 6.82333438e-03, 5.62012270e-02, 6.89233318e-02,
5.61566688e-02, -9.32652280e-02, 6.18273281e-02, 1.12723336e-01,
-1.04766667e-01, -2.15716790e-02, -1.15266666e-01, 4.57017794e-02,
7.47987852e-02, -9.02220607e-04, 7.75654465e-02, -2.66306698e-02,
1.93627775e-02, -4.89100069e-03, -1.43213451e-01, -6.52845576e-02,
1.64663326e-02, -5.07618897e-02, -1.49422223e-02, 4.21274304e-02,
1.06691113e-02, -5.97029589e-02, -1.20738111e-01, -1.61822215e-02,
-5.95551059e-02, 3.67141105e-02, 2.88833342e-02, 5.24356700e-02,
7.51844468e-03, -3.79579999e-02, 9.96864438e-02, 1.28289998e-01,
1.56755541e-02, -1.55926663e-02, -4.89732213e-02, 2.24273317e-02,
-9.15533304e-03, 7.32631087e-02, -7.48946667e-02, -1.15108885e-01,
-5.56773357e-02, -8.49866867e-03, -3.00188921e-02, 3.55113335e-02,
-4.22161110e-02, 7.19971135e-02, 3.67489979e-02, -1.00055551e-02,
7.52926618e-02, -1.43726662e-01, -4.08722041e-03, -1.49663329e-01,
1.41400262e-03, 5.52397817e-02, 8.86320025e-02, -7.44862184e-02,
-3.23222089e-03, 3.30205560e-02, 3.77681069e-02, 6.58650026e-02,
2.83081792e-02, -3.24210003e-02, 1.93070006e-02, 5.67157790e-02,
6.17166609e-02, 1.09540010e-02, 4.71896678e-02, 7.68444464e-02,
-2.51592230e-02, -4.28744499e-03, -2.40004435e-02, 3.28795537e-02,
1.25606894e-01, -6.05716556e-02, 5.52507788e-02, -2.12161113e-02,
-8.45399946e-02, -7.95067847e-02, -1.33965556e-02, -5.02544455e-02,
-3.03339995e-02, 1.19719980e-02, 6.15093298e-02, 1.11455554e-02,
1.24445252e-01, 5.54273315e-02, 1.28475904e-01, -9.19478834e-02,
-2.29498874e-02, -4.18815538e-02, 5.02915531e-02, -1.14721097e-02,
1.06602885e-01, -8.45602229e-02, -4.17976640e-02, 1.39088994e-02,
-2.19033333e-03, 7.99388885e-02, 1.08606648e-02, -1.27933361e-02,
-2.84678000e-03, -2.97433343e-02, -8.61347839e-02, 9.06177703e-03],
dtype=float32)
But when I running the following I get:
In [3]: for token in doc: print("{} : {}".format(token, token.vector[:3]))
Hello : [0. 0. 0.]
Stack : [0. 0. 0.]
Over : [0. 0. 0.]
Flow : [0. 0. 0.]
, : [-0.082752 0.67204 -0.14987 ]
my : [ 0.08649 0.14503 -0.4902 ]
name : [ 0.23231 -0.024102 -0.83964 ]
is : [-0.084961 0.502 0.0023823]
Steve : [0. 0. 0.]
Please advise why do I get different representations?
The first vector is whole sentence representation?
Please explain me why do I get different vectors?
The solution is: A real-valued meaning representation. Defaults to an average of the token vectors.
Source: https://spacy.io/api/doc#vector
Hope it will help others too.

How does sklearn.linear_model.LinearRegression work with insufficient data?

To solve a 5 parameter model, I need at least 5 data points to get a unique solution. For x and y data below:
import numpy as np
x = np.array([[-0.24155831, 0.37083184, -1.69002708, 1.4578805 , 0.91790011,
0.31648635, -0.15957368],
[-0.37541846, -0.14572825, -2.19695883, 1.01136142, 0.57288752,
0.32080956, -0.82986857],
[ 0.33815532, 3.1123936 , -0.29317028, 3.01493602, 1.64978158,
0.56301755, 1.3958912 ],
[ 0.84486735, 4.74567324, 0.7982888 , 3.56604097, 1.47633894,
1.38743513, 3.0679506 ],
[-0.2752026 , 2.9110031 , 0.19218081, 2.0691105 , 0.49240373,
1.63213241, 2.4235483 ],
[ 0.89942508, 5.09052174, 1.26048572, 3.73477373, 1.4302902 ,
1.91907482, 3.70126468]])
y = np.array([-0.81388378, -1.59719762, -0.08256274, 0.61297275, 0.99359647,
1.11315445])
I used only 6 data to fit a 8 parameter model (7 slopes and 1 intercept).
lr = LinearRegression().fit(x, y)
print(lr.coef_)
array([-0.83916772, -0.57249998, 0.73025938, -0.02065629, 0.47637768,
-0.36962192, 0.99128474])
print(lr.intercept_)
0.2978781587718828
Clearly, it's using some kind of assignment to reduce the degrees of freedom. I tried to look into the source code but couldn't found anything about that. What method do they use to find the parameter of under specified model?
You don't need to reduce the degrees of freedom, it simply finds a solution to the least squares problem min sum_i (dot(beta,x_i)+beta_0-y_i)**2. For example, in the non-sparse case it uses the linalg.lstsq module from scipy. The default solver for this optimization problem is the gelsd LAPACK driver. If
A= np.concatenate((ones_v, X), axis=1)
is the augmented array with ones as its first column, then your solution is given by
x=numpy.linalg.pinv(A.T*A)*A.T*y
Where we use the pseudoinverse precisely because the matrix may not be of full rank. Of course, the solver doesn't actually use this formula but uses singular value Decomposition of A to reduce this formula.

Why do mllib word2vec word vectors only have 100 elements?

I have a word2vec model that I created in PySpark. The model is saved as a .parquet file. I want to be able to access and query the model (or the words and word vectors) using vanilla Python because I am building a flask app that will allow a user to enter words of interest for finding synonyms.
I've extracted the words and word vectors, but I've noticed that while I have approximately 7000 unique words, my word vectors have a length of 100. For example, here are two words "serious" and "breaks". Their vectors only have a length of 100. Why is this? How is it able to then reconstruct the entire vector space with only 100 values for each word? Is it simply only giving me the top 100 or the first 100 values?
vectors.take(2)
Out[48]:
[Row(word=u'serious', vector=DenseVector([0.0784, -0.0882, -0.0342, -0.0153, 0.0223, 0.1034, 0.1218, -0.0814, -0.0198, -0.0325, -0.1024, -0.2412, -0.0704, -0.1575, 0.0342, -0.1447, -0.1687, 0.0673, 0.1248, 0.0623, -0.0078, -0.0813, 0.0953, -0.0213, 0.0031, 0.0773, -0.0246, -0.0822, -0.0252, -0.0274, -0.0288, 0.0403, -0.0419, -0.1122, -0.0397, 0.0186, -0.0038, 0.1279, -0.0123, 0.0091, 0.0065, 0.0884, 0.0899, -0.0479, 0.0328, 0.0171, -0.0962, 0.0753, -0.187, 0.034, -0.1393, -0.0575, -0.019, 0.0151, -0.0205, 0.0667, 0.0762, -0.0365, -0.025, -0.184, -0.0118, -0.0964, 0.1744, 0.0563, -0.0413, -0.054, -0.1764, -0.087, 0.0747, -0.022, 0.0778, -0.0014, -0.1313, -0.1133, -0.0669, 0.0007, -0.0378, -0.1093, -0.0732, 0.1494, -0.0815, -0.0137, 0.1009, -0.0057, 0.0195, 0.0085, 0.025, 0.0064, 0.0076, 0.0676, 0.1663, -0.0078, 0.0278, 0.0519, -0.0615, -0.0833, 0.0643, 0.0032, -0.0882, 0.1033])),
Row(word=u'breaks', vector=DenseVector([0.0065, 0.0027, -0.0121, 0.0296, -0.0467, 0.0297, 0.0499, 0.0843, 0.1027, 0.0179, -0.014, 0.0586, 0.06, 0.0534, 0.0391, -0.0098, -0.0266, -0.0422, 0.0188, 0.0065, -0.0309, 0.0038, -0.0458, -0.0252, 0.0428, 0.0046, -0.065, -0.0822, -0.0555, -0.0248, -0.0288, -0.0016, 0.0334, -0.0028, -0.0718, -0.0571, -0.0668, -0.0073, 0.0658, -0.0732, 0.0976, -0.0255, -0.0712, 0.0899, 0.0065, -0.04, 0.0964, 0.0356, 0.0142, 0.0857, 0.0669, -0.038, -0.0728, -0.0446, 0.1194, -0.056, 0.1022, 0.0459, -0.0343, -0.0861, -0.0943, -0.0435, -0.0573, 0.0229, 0.0368, 0.085, -0.0218, -0.0623, 0.0502, -0.0645, 0.0247, -0.0371, -0.0785, 0.0371, -0.0047, 0.0012, 0.0214, 0.0669, 0.049, -0.0294, -0.0272, 0.0642, -0.006, -0.0804, -0.06, 0.0719, -0.0109, -0.0272, -0.0366, 0.0041, 0.0556, 0.0108, 0.0624, 0.0134, -0.0094, 0.0219, 0.0164, -0.0545, -0.0055, -0.0193]))]
Any thoughts on the best way to reconstruct this model in vanilla python?
Just to improve on the comment by zero323, for anyone else who arrives here.
Word2Vec has a default setting to create word vectors of 100dims. To change this:
model = Word2Vec(sentences, size=300)
when initializing the model will create vectors of 300 dimensions.
I think the problem lays with your minCount parameter value for the Word2Vec model.
If this value is too high, less words get used in the training of the model resulting in a words vector of only 100.
7000 unique words is not a lot.
Try setting the minCount lower than the default 5.
model.setMinCount(value)
https://spark.apache.org/docs/latest/api/python/pyspark.ml.html?highlight=word2vec#pyspark.ml.feature.Word2Vec

Random Forest feature importance: how many are actually used?

I use RF twice in a row.
First, I fit it using max_features='auto' and the whole dataset (109 feature), in order to perform features selection.
The following is RandomForestClassifier.feature_importances_, it correctly gives me 109 score per each feature:
[0.00118087, 0.01268531, 0.0017589 , 0.01614814, 0.01105567,
0.0146838 , 0.0187875 , 0.0190427 , 0.01429976, 0.01311706,
0.01702717, 0.00901344, 0.01044047, 0.00932331, 0.01211333,
0.01271825, 0.0095337 , 0.00985686, 0.00952823, 0.01165877,
0.00193286, 0.0012602 , 0.00208145, 0.00203459, 0.00229907,
0.00242616, 0.00051358, 0.00071606, 0.00975515, 0.00171034,
0.01134927, 0.00687018, 0.00987706, 0.01507474, 0.01223525,
0.01170495, 0.00928417, 0.01083082, 0.01302036, 0.01002457,
0.00894818, 0.00833564, 0.00930602, 0.01100774, 0.00818604,
0.00675784, 0.00740617, 0.00185461, 0.00119627, 0.00159034,
0.00154336, 0.00478926, 0.00200773, 0.00063574, 0.00065675,
0.01104192, 0.00246746, 0.01663812, 0.01041134, 0.01401842,
0.02038318, 0.0202834 , 0.01290935, 0.01476593, 0.0108275 ,
0.0118773 , 0.01050919, 0.0111477 , 0.00684507, 0.01170021,
0.01291888, 0.00963295, 0.01161876, 0.00756015, 0.00178329,
0.00065709, 0. , 0.00246064, 0.00217982, 0.00305187,
0.00061284, 0.00063431, 0.01963523, 0.00265208, 0.01543552,
0.0176546 , 0.01443356, 0.01834896, 0.01385694, 0.01320648,
0.00966011, 0.0148321 , 0.01574166, 0.0167107 , 0.00791634,
0.01121442, 0.02171706, 0.01855552, 0.0257449 , 0.02925843,
0.01789742, 0. , 0. , 0.00379275, 0.0024365 ,
0.00333905, 0.00238971, 0.00068355, 0.00075399]
Then, I transform the dataset over the previous fit which should reduce its dimensionality, and then i re-fit RF over it.
Given max_features='auto' and the 109 feats, I would expect to have in total ~10 features instead, calling rf.feats_importance_, returns more (62):
[ 0.01261971, 0.02003921, 0.00961297, 0.02505467, 0.02038449,
0.02353745, 0.01893777, 0.01932577, 0.01681398, 0.01464485,
0.01672119, 0.00748981, 0.01109461, 0.01116948, 0.0087081 ,
0.01056344, 0.00971319, 0.01532258, 0.0167348 , 0.01601214,
0.01522208, 0.01625487, 0.01653784, 0.01483562, 0.01602748,
0.01522369, 0.01581573, 0.01406688, 0.01269036, 0.00884105,
0.02538574, 0.00637611, 0.01928382, 0.02061512, 0.02566056,
0.02180902, 0.01537295, 0.01796305, 0.01171095, 0.01179759,
0.01371328, 0.00811729, 0.01060708, 0.015717 , 0.01067911,
0.01773623, 0.0169396 , 0.0226369 , 0.01547827, 0.01499467,
0.01356075, 0.01040735, 0.01360752, 0.01754145, 0.01446933,
0.01845195, 0.0190799 , 0.02608652, 0.02095663, 0.02939744,
0.01870901, 0.02512201]
Why? Shouldn't it returns just ~10 features importances?
You misunderstood the meaning of max_features, which is
The number of features to consider when looking for the best split
It is not the number of features when transforming the data.
It is the threshold in transform method that determines the most important features.
threshold : string, float or None, optional (default=None)
The threshold value to use for feature selection. Features whose importance is greater or equal are kept while the others are discarded. If “median” (resp. “mean”), then the threshold value is the median (resp. the mean) of the feature importances. A scaling factor (e.g., “1.25*mean”) may also be used. If None and if available, the object attribute threshold is used. Otherwise, “mean” is used by default.

scikit-learn roc_curve: why does it return a threshold value = 2 some time?

Correct me if I'm wrong: the "thresholds" returned by scikit-learn's roc_curve should be an array of numbers that are in [0,1]. However, it sometimes gives me an array with the first number close to "2". Is it a bug or I did sth wrong? Thanks.
In [1]: import numpy as np
In [2]: from sklearn.metrics import roc_curve
In [3]: np.random.seed(11)
In [4]: aa = np.random.choice([True, False],100)
In [5]: bb = np.random.uniform(0,1,100)
In [6]: fpr,tpr,thresholds = roc_curve(aa,bb)
In [7]: thresholds
Out[7]:
array([ 1.97396826, 0.97396826, 0.9711752 , 0.95996265, 0.95744405,
0.94983331, 0.93290463, 0.93241372, 0.93214862, 0.93076592,
0.92960511, 0.92245024, 0.91179548, 0.91112166, 0.87529458,
0.84493853, 0.84068543, 0.83303741, 0.82565223, 0.81096657,
0.80656679, 0.79387241, 0.77054807, 0.76763223, 0.7644911 ,
0.75964947, 0.73995152, 0.73825262, 0.73466772, 0.73421299,
0.73282534, 0.72391126, 0.71296292, 0.70930102, 0.70116428,
0.69606617, 0.65869235, 0.65670881, 0.65261474, 0.6487222 ,
0.64805644, 0.64221486, 0.62699782, 0.62522484, 0.62283401,
0.61601839, 0.611632 , 0.59548669, 0.57555854, 0.56828967,
0.55652111, 0.55063947, 0.53885029, 0.53369398, 0.52157349,
0.51900774, 0.50547317, 0.49749635, 0.493913 , 0.46154029,
0.45275916, 0.44777116, 0.43822067, 0.43795921, 0.43624093,
0.42039077, 0.41866343, 0.41550367, 0.40032843, 0.36761763,
0.36642721, 0.36567017, 0.36148354, 0.35843793, 0.34371331,
0.33436415, 0.33408289, 0.33387442, 0.31887024, 0.31818719,
0.31367915, 0.30216469, 0.30097917, 0.29995201, 0.28604467,
0.26930354, 0.2383461 , 0.22803687, 0.21800338, 0.19301808,
0.16902881, 0.1688173 , 0.14491946, 0.13648451, 0.12704826,
0.09141459, 0.08569481, 0.07500199, 0.06288762, 0.02073298,
0.01934336])
Most of the time these thresholds are not used, for example in calculating the area under the curve, or plotting the False Positive Rate against the True Positive Rate.
Yet to plot what looks like a reasonable curve, one needs to have a threshold that incorporates 0 data points. Since Scikit-Learn's ROC curve function need not have normalised probabilities for thresholds (any score is fine), setting this point's threshold to 1 isn't sufficient; setting it to inf is sensible but coders often expect finite data (and it's possible the implementation also works for integer thresholds). Instead the implementation uses max(score) + epsilon where epsilon = 1. This may be cosmetically deficient, but you haven't given any reason why it's a problem!
From the documentation:
thresholds : array, shape = [n_thresholds]
Decreasing thresholds on the decision function used to compute
fpr and tpr. thresholds[0] represents no instances being predicted
and is arbitrarily set to max(y_score) + 1.
So the first element of thresholds is close to 2 because it is max(y_score) + 1, in your case thresholds[1] + 1.
this seems like a bug to me - in roc_curve(aa,bb), 1 is added to the first threshold. You should create an issue here https://github.com/scikit-learn/scikit-learn/issues

Resources