Random Forest feature importance: how many are actually used? - scikit-learn

I use RF twice in a row.
First, I fit it using max_features='auto' and the whole dataset (109 feature), in order to perform features selection.
The following is RandomForestClassifier.feature_importances_, it correctly gives me 109 score per each feature:
[0.00118087, 0.01268531, 0.0017589 , 0.01614814, 0.01105567,
0.0146838 , 0.0187875 , 0.0190427 , 0.01429976, 0.01311706,
0.01702717, 0.00901344, 0.01044047, 0.00932331, 0.01211333,
0.01271825, 0.0095337 , 0.00985686, 0.00952823, 0.01165877,
0.00193286, 0.0012602 , 0.00208145, 0.00203459, 0.00229907,
0.00242616, 0.00051358, 0.00071606, 0.00975515, 0.00171034,
0.01134927, 0.00687018, 0.00987706, 0.01507474, 0.01223525,
0.01170495, 0.00928417, 0.01083082, 0.01302036, 0.01002457,
0.00894818, 0.00833564, 0.00930602, 0.01100774, 0.00818604,
0.00675784, 0.00740617, 0.00185461, 0.00119627, 0.00159034,
0.00154336, 0.00478926, 0.00200773, 0.00063574, 0.00065675,
0.01104192, 0.00246746, 0.01663812, 0.01041134, 0.01401842,
0.02038318, 0.0202834 , 0.01290935, 0.01476593, 0.0108275 ,
0.0118773 , 0.01050919, 0.0111477 , 0.00684507, 0.01170021,
0.01291888, 0.00963295, 0.01161876, 0.00756015, 0.00178329,
0.00065709, 0. , 0.00246064, 0.00217982, 0.00305187,
0.00061284, 0.00063431, 0.01963523, 0.00265208, 0.01543552,
0.0176546 , 0.01443356, 0.01834896, 0.01385694, 0.01320648,
0.00966011, 0.0148321 , 0.01574166, 0.0167107 , 0.00791634,
0.01121442, 0.02171706, 0.01855552, 0.0257449 , 0.02925843,
0.01789742, 0. , 0. , 0.00379275, 0.0024365 ,
0.00333905, 0.00238971, 0.00068355, 0.00075399]
Then, I transform the dataset over the previous fit which should reduce its dimensionality, and then i re-fit RF over it.
Given max_features='auto' and the 109 feats, I would expect to have in total ~10 features instead, calling rf.feats_importance_, returns more (62):
[ 0.01261971, 0.02003921, 0.00961297, 0.02505467, 0.02038449,
0.02353745, 0.01893777, 0.01932577, 0.01681398, 0.01464485,
0.01672119, 0.00748981, 0.01109461, 0.01116948, 0.0087081 ,
0.01056344, 0.00971319, 0.01532258, 0.0167348 , 0.01601214,
0.01522208, 0.01625487, 0.01653784, 0.01483562, 0.01602748,
0.01522369, 0.01581573, 0.01406688, 0.01269036, 0.00884105,
0.02538574, 0.00637611, 0.01928382, 0.02061512, 0.02566056,
0.02180902, 0.01537295, 0.01796305, 0.01171095, 0.01179759,
0.01371328, 0.00811729, 0.01060708, 0.015717 , 0.01067911,
0.01773623, 0.0169396 , 0.0226369 , 0.01547827, 0.01499467,
0.01356075, 0.01040735, 0.01360752, 0.01754145, 0.01446933,
0.01845195, 0.0190799 , 0.02608652, 0.02095663, 0.02939744,
0.01870901, 0.02512201]
Why? Shouldn't it returns just ~10 features importances?

You misunderstood the meaning of max_features, which is
The number of features to consider when looking for the best split
It is not the number of features when transforming the data.
It is the threshold in transform method that determines the most important features.
threshold : string, float or None, optional (default=None)
The threshold value to use for feature selection. Features whose importance is greater or equal are kept while the others are discarded. If “median” (resp. “mean”), then the threshold value is the median (resp. the mean) of the feature importances. A scaling factor (e.g., “1.25*mean”) may also be used. If None and if available, the object attribute threshold is used. Otherwise, “mean” is used by default.

Related

For Friedman Test, what are the pros and cons of different post hoc comparison method?

scikit_posthocs lists some post hoc pairwise comparison methods if significant results of the Friedman’s test are obtained. How do I decide which one to use? In my case, the toy dataset df that I have is:
df = pd.DataFrame(
data=np.array([[ 581.125, 366.25 , 144.08 , 296.575],
[1743.4 , 900.85 , 338.925, 481.6 ],
[4335.75 , 2000.1 , 816.75 , 1236.75 ],
[ 241.925, 540.35 , 249.9 , 167.5 ],
[2822.4 , 1261.35 , 547.025, 626.05 ],
[ 568.32 , 652.55 , 277.34 , 265.68 ]]),
columns=['Both', 'Media Advertisement', 'Neither', 'Store Events'],
index=['Hardware Accessories', 'Pants', 'Shoes', 'Shorts', 'Sweatshirts',
'T-Shirts']
)
I first ran:
from scipy.stats import friedmanchisquare
print(friedmanchisquare(*df.T.values))
This gives me
FriedmanchisquareResult(statistic=12.200000000000003, pvalue=0.006728522930461357)
Now I need to do post hoc analysis, I first ran:
import scikit_posthocs as sp
sp.posthoc_conover_friedman(q1_friedman).style.applymap(lambda x: "background-color: yellow" if x < 0.05 else "")
It gives me this:
I am confused which post hoc analysis approach I should choose to meet my analytic need. In post hocs API, there are posthoc_nemenyi_friedman, posthoc_conover_friedman, posthoc_siegel_friedman and posthoc_miller_friedman for me to choose from. I tried them all and they returned some conclusions that may slightly differ from one another. Statistically speaking, which approach is appropriate for my case?

reg.coef_ returning same value twice

created a linear regression model in which tried to find the weights(coefficients) and bias(y-intercept) thereby running the following code:-
reg.intercept_
reg.coef_
output
array([9.41523946, 9.41523946])
array([[-0.44871341, 0.20903483, 0.0142496 , 0.01288174, -0.14055166,
-0.17990912, -0.06054988, -0.08992433, -0.1454692 , -0.10144383,
-0.20062984, -0.12988747, -0.16859669, -0.12149035, -0.03336798,
-0.14690868, 0.32047333],
[-0.44871341, 0.20903483, 0.0142496 , 0.01288174, -0.14055166,
-0.17990912, -0.06054988, -0.08992433, -0.1454692 , -0.10144383,
-0.20062984, -0.12988747, -0.16859669, -0.12149035, -0.03336798,
-0.14690868, 0.32047333]])
getting the same values twice rather getting them only once and hence having difficulty in summarizing the weights
It could be that your input provides same columns as input for you reg. model.
Can you provide a sample of your dataset?

How does sklearn.linear_model.LinearRegression work with insufficient data?

To solve a 5 parameter model, I need at least 5 data points to get a unique solution. For x and y data below:
import numpy as np
x = np.array([[-0.24155831, 0.37083184, -1.69002708, 1.4578805 , 0.91790011,
0.31648635, -0.15957368],
[-0.37541846, -0.14572825, -2.19695883, 1.01136142, 0.57288752,
0.32080956, -0.82986857],
[ 0.33815532, 3.1123936 , -0.29317028, 3.01493602, 1.64978158,
0.56301755, 1.3958912 ],
[ 0.84486735, 4.74567324, 0.7982888 , 3.56604097, 1.47633894,
1.38743513, 3.0679506 ],
[-0.2752026 , 2.9110031 , 0.19218081, 2.0691105 , 0.49240373,
1.63213241, 2.4235483 ],
[ 0.89942508, 5.09052174, 1.26048572, 3.73477373, 1.4302902 ,
1.91907482, 3.70126468]])
y = np.array([-0.81388378, -1.59719762, -0.08256274, 0.61297275, 0.99359647,
1.11315445])
I used only 6 data to fit a 8 parameter model (7 slopes and 1 intercept).
lr = LinearRegression().fit(x, y)
print(lr.coef_)
array([-0.83916772, -0.57249998, 0.73025938, -0.02065629, 0.47637768,
-0.36962192, 0.99128474])
print(lr.intercept_)
0.2978781587718828
Clearly, it's using some kind of assignment to reduce the degrees of freedom. I tried to look into the source code but couldn't found anything about that. What method do they use to find the parameter of under specified model?
You don't need to reduce the degrees of freedom, it simply finds a solution to the least squares problem min sum_i (dot(beta,x_i)+beta_0-y_i)**2. For example, in the non-sparse case it uses the linalg.lstsq module from scipy. The default solver for this optimization problem is the gelsd LAPACK driver. If
A= np.concatenate((ones_v, X), axis=1)
is the augmented array with ones as its first column, then your solution is given by
x=numpy.linalg.pinv(A.T*A)*A.T*y
Where we use the pseudoinverse precisely because the matrix may not be of full rank. Of course, the solver doesn't actually use this formula but uses singular value Decomposition of A to reduce this formula.

How do you compute the distance between text documents for k-means with word2vec?

I have recently been introduced to word2vec and I'm having some trouble trying to figure out how exactly it is used for k-means clustering.
I do understand how k-means works with tf-idf vectors. For each text document you have a vector of tf-idf values and after choosing some documents as initial cluster centers, you can use the euclidian distance to minimise the the distances between the vectors of the documents. Here's an example.
However, when using word2vec, each word is represented as a vector. Does this mean that each document corresponds to a matrix? And if so, how do you compute the minimum distance w.r.t. other text documents?
Question: How do you compute the distance between text documents for k-means with word2vec?
Edit: To explain my confusion in a bit more detail, please consider the following code:
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(sentences_tfidf)
print(tfidf_matrix.toarray())
model = Word2Vec(sentences_word2vec, min_count=1)
word2vec_matrix = model[model.wv.vocab]
print(len(word2vec_matrix))
for i in range(0,len(word2vec_matrix)):
print(X[i])
It returns the following code:
[[ 0. 0.55459491 0. 0. 0.35399075 0. 0.
0. 0. 0. 0. 0. 0. 0.437249
0.35399075 0.35399075 0.35399075 0. ]
[ 0. 0. 0. 0.44302215 0.2827753 0. 0.
0. 0.34928375 0. 0. 0. 0.34928375
0. 0.2827753 0.5655506 0.2827753 0. ]
[ 0. 0. 0.35101741 0. 0. 0.27674616
0.35101741 0. 0. 0.35101741 0. 0.35101741
0.27674616 0.27674616 0.44809973 0. 0. 0.27674616]
[ 0.40531999 0. 0. 0. 0.2587105 0.31955894
0. 0.40531999 0.31955894 0. 0.40531999 0. 0.
0. 0. 0.2587105 0.2587105 0.31955894]]
20
[ 4.08335682e-03 -4.44161100e-03 3.92342824e-03 3.96498619e-03
6.99949533e-06 -2.14108804e-04 1.20419310e-03 -1.29191438e-03
1.64671184e-03 3.41688609e-03 -4.94929403e-03 2.90348311e-03
4.23802016e-03 -3.01274913e-03 -7.36164337e-04 3.47558968e-03
-7.02908786e-04 4.73567843e-03 -1.42914290e-03 3.17237526e-03
9.36070050e-04 -2.23833631e-04 -4.03443904e-04 4.97530040e-04
-4.82502300e-03 2.42140982e-03 -3.61089432e-03 3.37070058e-04
-2.09900597e-03 -1.82093668e-03 -4.74618562e-03 2.41499138e-03
-2.15628324e-03 3.43719614e-03 7.50159554e-04 -2.05973233e-03
1.92534993e-03 1.96503079e-03 -2.02400610e-03 3.99564439e-03
4.95056808e-03 1.47033704e-03 -2.80071306e-03 3.59585625e-04
-2.77896033e-04 -3.21732066e-03 4.36303904e-03 -2.16396619e-03
2.24438333e-03 -4.50925855e-03 -4.70488053e-03 6.30825118e-04
3.81869613e-03 3.75767215e-03 5.01064525e-04 1.70175335e-03
-1.26033701e-04 -7.43318116e-04 -6.74833194e-04 -4.76678275e-03
1.53754558e-03 2.32421421e-03 -3.23472451e-03 -8.32759659e-04
4.67014220e-03 5.15853462e-04 -1.15449808e-03 -1.63017167e-03
-2.73897988e-03 -3.95627553e-03 4.04657237e-03 -1.79282576e-03
-3.26930732e-03 2.85121426e-03 -2.33304151e-03 -2.01760884e-03
-3.33597139e-03 -1.19233003e-03 -2.12347694e-03 4.36858647e-03
2.00414215e-03 -4.23572073e-03 4.98410035e-03 1.79121632e-03
4.81655030e-03 3.33247939e-03 -3.95260006e-03 1.19335402e-03
4.61675343e-04 6.09758368e-04 -4.74696746e-03 4.91552567e-03
1.74517138e-03 2.36604619e-03 -3.06009664e-04 3.62954312e-03
3.56943789e-03 2.92139384e-03 -4.27138479e-03 -3.51175456e-03]
[ -4.14272398e-03 3.45513038e-03 -1.47538856e-04 -2.02292087e-03
-2.96578306e-04 1.88684417e-03 -2.63865804e-03 2.69249966e-03
4.57606697e-03 2.19206396e-03 2.01336667e-03 1.47434452e-03
1.88332598e-03 -1.14452699e-03 -1.35678309e-03 -2.02636060e-04
-3.26160830e-03 -3.95368552e-03 1.40415027e-03 2.30542314e-03
-3.18884710e-03 -4.46776347e-03 3.96415358e-03 -2.07852037e-03
4.98413946e-03 -6.43568579e-04 -2.53325375e-03 1.30117545e-03
1.26555841e-03 -8.84680718e-04 -8.34991166e-04 -4.15050285e-03
4.66807076e-04 1.71844949e-04 1.08140183e-03 4.37910948e-03
-3.28412466e-03 2.09890743e-04 2.29888223e-03 4.70223464e-03
-2.31004297e-03 -5.10134443e-04 2.57104915e-03 -2.55978899e-03
-7.55646848e-04 -1.98197929e-04 1.20443532e-04 4.63618943e-03
1.13036349e-05 8.16594984e-04 -1.65917678e-03 3.29331891e-03
-4.97825304e-03 -2.03667139e-03 3.60272871e-03 7.44500838e-04
-4.40325850e-04 6.38399797e-04 -4.23364760e-03 -4.56386572e-03
4.77551389e-03 4.74880403e-03 7.06148741e-04 -1.24937459e-03
-9.50689311e-04 -3.88551364e-03 -4.45985980e-03 -1.15060725e-03
3.27067473e-03 4.54987818e-03 2.62327422e-03 -2.40981602e-03
4.55576897e-04 3.19155119e-03 -3.84227419e-03 -1.17610034e-03
-1.45622855e-03 -4.32460709e-03 -4.12792247e-03 -1.74557802e-03
4.66075348e-04 3.39668151e-03 -4.00651991e-03 1.41077011e-03
-7.89384532e-04 -6.56061340e-04 1.14822399e-03 4.12205653e-03
3.60721885e-03 -3.11746349e-04 1.44255662e-03 3.11965472e-03
-4.93455213e-03 4.80490318e-03 2.79991422e-03 4.93505970e-03
3.69034940e-03 4.76422161e-03 -1.25827035e-03 -1.94680784e-03]
...
[ -3.92252317e-04 -3.66805331e-03 1.52376946e-03 -3.81564132e-05
-2.57118000e-03 -4.46725264e-03 2.36480637e-03 -4.70252614e-03
-4.18651942e-03 4.54758806e-03 4.38804098e-04 1.28351408e-03
3.40470579e-03 1.00038981e-03 -1.06557179e-03 4.67202952e-03
4.50591929e-03 -2.67829909e-03 2.57702312e-03 -3.65824508e-03
-4.54068230e-03 2.20785337e-03 -1.00554363e-03 5.14690124e-04
4.64830594e-03 1.91410910e-03 -4.83837258e-03 6.73376708e-05
-2.37796479e-03 -4.45193471e-03 -2.60163331e-03 1.51159777e-03
4.06868104e-03 2.55690538e-04 -2.54662265e-03 2.64597777e-03
-2.62586889e-03 -2.71554058e-03 5.49281889e-04 -1.38776843e-03
-2.94354092e-03 -1.13887887e-03 4.59292997e-03 -1.02300232e-03
2.27600057e-03 -4.88117011e-03 1.95790920e-03 4.64376673e-04
2.56658648e-03 8.90390365e-04 -1.40368659e-03 -6.40658545e-04
-3.53228673e-03 -1.30717538e-03 -1.80223631e-03 2.94505036e-03
-4.82233381e-03 -2.16079340e-03 2.58940039e-03 1.60595961e-03
-1.22245611e-03 -6.72614493e-04 4.47060820e-03 -4.95934719e-03
2.70283176e-03 2.93257344e-03 2.13279200e-04 2.59435410e-03
2.98801321e-03 -2.79974379e-03 -1.49789048e-04 -2.53924704e-03
-7.83207070e-04 1.18357304e-03 -1.27669750e-03 -4.16665291e-03
1.40916929e-03 1.63017987e-07 1.36708119e-03 -1.26687710e-05
1.24729215e-03 -2.50442210e-03 -3.20308795e-03 -1.41550787e-03
-1.05747324e-03 -3.97984264e-03 2.25877413e-03 -1.28316227e-03
3.60359484e-03 -1.97929185e-04 3.21712159e-03 -4.96298913e-03
-1.83640339e-03 -9.90608009e-04 -2.03964626e-03 -4.87274351e-03
7.24950165e-04 3.85614252e-03 -4.18979349e-03 2.73840013e-03]
Using tfidf, k-means would be implemented by the lines
kmeans = KMeans(n_clusters = 5)
kmeans.fit(tfidf_matrix)
Using word2vec, k-means would be implemented by the lines
kmeans = KMeans(n_clusters = 5)
kmeans.fit(word2vec_matrix)
(Here's an example of k-means with word2vec). So in the first case, k-means gets a matrix with the tf-idf values of each word per document, while in the second case k-means gets a vector for each word. How can k-means cluster the documents in the second case if it just has the word2vec representations?
Since you are interested in clustering documents, probably the best you can do is to use the Doc2Vec package, which can prepare a vector for each one of your documents. Then you can apply any clustering algorithm to the set of your document vectors for further processing. If, for any reason, you want to use word vectors instead, there are a few things you can do. Here is a very simple method:
For each document, collect all words with the highest TF-IDF values w.r.t. that document.
Average the Word2Vec vectors of those words to create a vector for the whole document
Apply your clustering on the averaged vectors.
Don't try to average all the words in a document, it won't work.

scikit-learn roc_curve: why does it return a threshold value = 2 some time?

Correct me if I'm wrong: the "thresholds" returned by scikit-learn's roc_curve should be an array of numbers that are in [0,1]. However, it sometimes gives me an array with the first number close to "2". Is it a bug or I did sth wrong? Thanks.
In [1]: import numpy as np
In [2]: from sklearn.metrics import roc_curve
In [3]: np.random.seed(11)
In [4]: aa = np.random.choice([True, False],100)
In [5]: bb = np.random.uniform(0,1,100)
In [6]: fpr,tpr,thresholds = roc_curve(aa,bb)
In [7]: thresholds
Out[7]:
array([ 1.97396826, 0.97396826, 0.9711752 , 0.95996265, 0.95744405,
0.94983331, 0.93290463, 0.93241372, 0.93214862, 0.93076592,
0.92960511, 0.92245024, 0.91179548, 0.91112166, 0.87529458,
0.84493853, 0.84068543, 0.83303741, 0.82565223, 0.81096657,
0.80656679, 0.79387241, 0.77054807, 0.76763223, 0.7644911 ,
0.75964947, 0.73995152, 0.73825262, 0.73466772, 0.73421299,
0.73282534, 0.72391126, 0.71296292, 0.70930102, 0.70116428,
0.69606617, 0.65869235, 0.65670881, 0.65261474, 0.6487222 ,
0.64805644, 0.64221486, 0.62699782, 0.62522484, 0.62283401,
0.61601839, 0.611632 , 0.59548669, 0.57555854, 0.56828967,
0.55652111, 0.55063947, 0.53885029, 0.53369398, 0.52157349,
0.51900774, 0.50547317, 0.49749635, 0.493913 , 0.46154029,
0.45275916, 0.44777116, 0.43822067, 0.43795921, 0.43624093,
0.42039077, 0.41866343, 0.41550367, 0.40032843, 0.36761763,
0.36642721, 0.36567017, 0.36148354, 0.35843793, 0.34371331,
0.33436415, 0.33408289, 0.33387442, 0.31887024, 0.31818719,
0.31367915, 0.30216469, 0.30097917, 0.29995201, 0.28604467,
0.26930354, 0.2383461 , 0.22803687, 0.21800338, 0.19301808,
0.16902881, 0.1688173 , 0.14491946, 0.13648451, 0.12704826,
0.09141459, 0.08569481, 0.07500199, 0.06288762, 0.02073298,
0.01934336])
Most of the time these thresholds are not used, for example in calculating the area under the curve, or plotting the False Positive Rate against the True Positive Rate.
Yet to plot what looks like a reasonable curve, one needs to have a threshold that incorporates 0 data points. Since Scikit-Learn's ROC curve function need not have normalised probabilities for thresholds (any score is fine), setting this point's threshold to 1 isn't sufficient; setting it to inf is sensible but coders often expect finite data (and it's possible the implementation also works for integer thresholds). Instead the implementation uses max(score) + epsilon where epsilon = 1. This may be cosmetically deficient, but you haven't given any reason why it's a problem!
From the documentation:
thresholds : array, shape = [n_thresholds]
Decreasing thresholds on the decision function used to compute
fpr and tpr. thresholds[0] represents no instances being predicted
and is arbitrarily set to max(y_score) + 1.
So the first element of thresholds is close to 2 because it is max(y_score) + 1, in your case thresholds[1] + 1.
this seems like a bug to me - in roc_curve(aa,bb), 1 is added to the first threshold. You should create an issue here https://github.com/scikit-learn/scikit-learn/issues

Resources