How does sklearn.linear_model.LinearRegression work with insufficient data? - scikit-learn

To solve a 5 parameter model, I need at least 5 data points to get a unique solution. For x and y data below:
import numpy as np
x = np.array([[-0.24155831, 0.37083184, -1.69002708, 1.4578805 , 0.91790011,
0.31648635, -0.15957368],
[-0.37541846, -0.14572825, -2.19695883, 1.01136142, 0.57288752,
0.32080956, -0.82986857],
[ 0.33815532, 3.1123936 , -0.29317028, 3.01493602, 1.64978158,
0.56301755, 1.3958912 ],
[ 0.84486735, 4.74567324, 0.7982888 , 3.56604097, 1.47633894,
1.38743513, 3.0679506 ],
[-0.2752026 , 2.9110031 , 0.19218081, 2.0691105 , 0.49240373,
1.63213241, 2.4235483 ],
[ 0.89942508, 5.09052174, 1.26048572, 3.73477373, 1.4302902 ,
1.91907482, 3.70126468]])
y = np.array([-0.81388378, -1.59719762, -0.08256274, 0.61297275, 0.99359647,
1.11315445])
I used only 6 data to fit a 8 parameter model (7 slopes and 1 intercept).
lr = LinearRegression().fit(x, y)
print(lr.coef_)
array([-0.83916772, -0.57249998, 0.73025938, -0.02065629, 0.47637768,
-0.36962192, 0.99128474])
print(lr.intercept_)
0.2978781587718828
Clearly, it's using some kind of assignment to reduce the degrees of freedom. I tried to look into the source code but couldn't found anything about that. What method do they use to find the parameter of under specified model?

You don't need to reduce the degrees of freedom, it simply finds a solution to the least squares problem min sum_i (dot(beta,x_i)+beta_0-y_i)**2. For example, in the non-sparse case it uses the linalg.lstsq module from scipy. The default solver for this optimization problem is the gelsd LAPACK driver. If
A= np.concatenate((ones_v, X), axis=1)
is the augmented array with ones as its first column, then your solution is given by
x=numpy.linalg.pinv(A.T*A)*A.T*y
Where we use the pseudoinverse precisely because the matrix may not be of full rank. Of course, the solver doesn't actually use this formula but uses singular value Decomposition of A to reduce this formula.

Related

detect highest peaks automatically from noisy data python

Is there any way to detect the highest peaks using a python library without setting any parameter?. I'm developing a user interface and I want the algorithm to be able to detect highest peaks automatically...
I want it to be able to detect these peaks in picture below:
graph here
Data looks like this:
8.60291e-07
-1.5491e-06
5.64568e-07
-9.51195e-07
1.07203e-06
4.6521e-07
6.43967e-07
-9.86092e-07
-9.82323e-07
6.38977e-07
-1.93884e-06
-2.98309e-08
1.33543e-06
1.05064e-06
1.17332e-06
-1.53549e-07
-8.9357e-07
1.59176e-06
-2.17331e-06
1.46756e-06
5.63301e-07
-8.77556e-07
7.47681e-09
-8.30101e-07
-3.6647e-07
5.27046e-07
-1.94983e-06
1.89018e-07
1.22533e-06
8.00735e-07
-8.51166e-07
1.13437e-06
-2.75787e-07
1.79601e-06
-1.67875e-06
1.13529e-06
-1.29865e-06
9.9688e-07
-9.34486e-07
8.89931e-07
-3.88634e-07
1.15124e-06
-4.23569e-07
-1.8029e-07
1.20537e-07
4.10736e-07
-9.99077e-07
-3.62984e-07
2.97916e-06
-1.95828e-06
-1.07398e-06
2.422e-06
-6.33202e-07
-1.36953e-06
1.6694e-06
-4.71764e-07
3.98849e-07
-1.0071e-06
-9.72984e-07
8.13553e-07
2.64193e-06
-3.12365e-06
1.34049e-06
-1.30419e-06
1.48369e-07
1.26033e-06
-2.59872e-07
4.28284e-07
-6.44356e-07
2.99934e-07
8.34335e-07
3.53226e-07
-7.08252e-07
4.1243e-07
2.41525e-06
-8.92159e-07
8.82339e-08
4.31945e-06
3.75152e-06
1.091e-06
3.8204e-06
-1.21356e-06
3.35564e-06
-1.06234e-06
-5.99808e-07
2.18155e-06
5.90652e-07
-1.36728e-06
-4.97017e-07
-7.77283e-08
8.68263e-07
4.37645e-07
-1.26514e-06
2.26413e-06
-8.52966e-07
-7.35596e-07
4.11911e-07
1.7585e-06
-inf
1.10779e-08
-1.49507e-06
9.87305e-07
-3.85296e-06
4.31265e-06
-9.89227e-07
-1.33537e-06
4.1713e-07
1.89362e-07
3.21968e-07
6.80237e-08
2.31636e-07
-2.98523e-07
7.99133e-07
7.36305e-07
6.39862e-07
-1.11932e-06
-1.57262e-06
1.86305e-06
-3.63716e-07
3.83865e-07
-5.23293e-07
1.31812e-06
-1.23608e-06
2.54684e-06
-3.99796e-06
2.90441e-06
-5.20203e-07
1.36295e-06
-1.89317e-06
1.22366e-06
-1.10373e-06
2.71276e-06
9.48181e-07
7.70881e-06
5.17066e-06
6.21254e-06
1.3513e-05
1.47878e-05
8.78543e-06
1.61819e-05
1.68438e-05
1.16082e-05
5.74059e-06
4.92458e-06
1.11884e-06
-1.07419e-06
-1.28517e-06
-2.70949e-06
1.65662e-06
1.42964e-06
3.40604e-06
-5.82825e-07
1.98288e-06
1.42819e-06
1.65517e-06
4.42749e-07
-1.95609e-06
-2.1756e-07
1.69164e-06
8.7204e-08
-5.35324e-07
7.43546e-07
-1.08687e-06
2.07289e-06
2.18529e-06
-2.8161e-06
1.88821e-06
4.07272e-07
1.063e-06
8.47244e-07
1.53879e-06
-9.0799e-07
-1.26709e-07
2.40044e-06
-9.48166e-07
1.41788e-06
3.67615e-07
-1.29199e-06
3.868e-06
9.54654e-06
2.51951e-05
2.2769e-05
7.21716e-06
1.36545e-06
-1.32681e-06
-3.09641e-06
4.90417e-07
2.99335e-06
1.578e-06
6.0025e-07
2.90656e-06
-2.08258e-06
-1.54214e-06
2.19757e-07
3.74982e-06
-1.76944e-06
2.15018e-06
-1.01935e-06
4.37469e-07
1.39078e-06
6.39587e-07
-1.7807e-06
-6.16455e-09
1.61557e-06
1.59644e-06
-2.35217e-06
5.29449e-07
1.9169e-06
-7.54822e-07
2.00342e-06
-3.28452e-06
3.91663e-06
1.66016e-08
-2.65897e-06
-1.4064e-06
4.67987e-07
1.67786e-06
4.69543e-07
-8.90106e-07
-1.4584e-06
1.37915e-06
1.98483e-06
-2.3735e-06
4.45618e-07
1.91504e-06
1.09653e-06
-8.00873e-07
1.32321e-06
2.04846e-06
-1.50656e-06
7.23816e-07
2.06049e-06
-2.43918e-06
1.64417e-06
2.65411e-07
-2.66107e-06
-8.01788e-07
2.05121e-06
-1.74988e-06
1.83594e-06
-8.14026e-07
-2.69342e-06
1.81152e-06
1.11664e-07
-4.21863e-06
-7.20551e-06
-5.92407e-07
-1.44629e-06
-2.08136e-06
2.86105e-06
3.77911e-06
-1.91898e-06
1.41742e-06
2.67914e-07
-8.55835e-07
-9.8584e-07
-2.74115e-06
3.39044e-06
1.39639e-06
-2.4964e-06
8.2486e-07
2.02432e-06
1.65793e-06
-1.43094e-06
-3.36807e-06
-8.96515e-07
5.31323e-06
-8.27209e-07
-1.39221e-06
-3.3754e-06
2.12372e-06
3.08218e-06
-1.42947e-06
-2.36777e-06
3.86218e-06
2.29327e-06
-3.3941e-06
-1.67291e-06
2.63828e-06
2.21008e-07
7.07794e-07
1.8172e-06
-2.00082e-06
1.80664e-06
6.69739e-07
-3.95395e-06
1.92148e-06
-1.07187e-06
-4.04938e-07
-1.76553e-06
2.7099e-06
1.30768e-06
1.41812e-06
-1.55518e-07
-3.78302e-06
4.00137e-06
-8.38623e-07
4.54651e-07
1.00027e-06
1.32196e-06
-2.62717e-06
1.67865e-06
-6.99249e-07
2.8837e-06
-1.00516e-06
-3.68011e-06
1.61847e-06
1.90887e-06
1.59641e-06
4.16779e-07
-1.35245e-06
1.65717e-06
-2.92667e-06
3.6203e-07
2.53528e-06
-2.0578e-07
-3.41919e-07
-1.42154e-06
-2.33322e-06
3.07175e-06
-2.69165e-08
-8.21045e-07
2.3175e-06
-7.22992e-07
1.49069e-06
8.75488e-07
-2.02676e-06
-2.81158e-07
3.6004e-06
-3.94708e-06
4.72983e-06
-1.38873e-06
-6.92139e-08
-1.4678e-06
1.04251e-06
-2.06625e-06
3.10406e-06
-8.13873e-07
7.23694e-07
-9.78912e-07
-8.65967e-07
7.37335e-07
1.52563e-06
-2.33591e-06
1.78265e-06
9.58435e-07
-5.22064e-07
-2.29736e-07
-4.26996e-06
-6.61411e-06
1.14789e-06
-4.32697e-06
-5.32779e-06
2.12241e-06
-1.40726e-06
1.76086e-07
-3.77194e-06
-2.71326e-06
-9.49402e-08
1.70807e-07
-2.495e-06
4.22324e-06
-3.62476e-06
-9.56055e-07
7.16583e-07
3.01447e-06
-1.41229e-06
-1.67694e-06
7.61627e-07
3.55881e-06
2.31015e-06
-9.50378e-07
4.45251e-08
-1.94791e-06
2.27081e-06
-3.34717e-06
3.05688e-06
4.57062e-07
3.87326e-06
-2.39215e-06
-3.52682e-06
-2.05212e-06
5.26495e-06
-3.28613e-07
-5.76569e-07
-7.46338e-07
5.98795e-06
8.80493e-07
-4.82965e-06
2.56839e-06
-1.58792e-06
-2.2294e-06
1.83841e-06
2.65482e-06
-3.10474e-06
-3.46741e-07
2.45557e-06
2.01328e-06
-3.92606e-06
inf
-8.11737e-07
5.72174e-07
1.57245e-06
8.02612e-09
-2.901e-06
1.22079e-06
-6.31714e-07
3.06241e-06
1.20059e-06
-1.80344e-06
4.90784e-07
3.74243e-06
-2.94342e-07
-3.45764e-08
-3.42099e-06
-1.43695e-06
5.91064e-07
3.47308e-06
3.78232e-06
4.01093e-07
-1.58435e-06
-3.47375e-06
1.34943e-06
1.11768e-06
1.95212e-06
-8.28033e-07
1.53705e-06
6.38031e-07
-1.84702e-06
1.34689e-06
-6.98669e-07
1.81653e-06
-2.42355e-06
-1.35257e-06
3.04367e-06
-1.21976e-06
1.61896e-06
-2.69528e-06
1.84601e-06
6.45447e-08
-4.94263e-07
3.47568e-06
-2.00531e-06
3.56693e-06
-3.19446e-06
2.72141e-06
-1.39059e-06
2.20032e-06
-1.76819e-06
2.32727e-07
-3.47382e-07
2.11823e-07
-5.22614e-07
2.69846e-06
-1.47983e-06
2.14554e-06
-6.27594e-07
-8.8501e-10
7.89124e-07
-2.8653e-07
8.30902e-07
-2.12857e-06
-1.90887e-07
1.07593e-06
1.40781e-06
2.41641e-06
-4.52689e-06
2.37207e-06
-2.19479e-06
1.65131e-06
1.2706e-06
-2.18387e-06
-1.72821e-07
5.41687e-07
7.2879e-07
7.56927e-07
1.57739e-06
-3.79395e-07
-1.02887e-06
-1.20987e-06
1.43066e-06
8.96301e-08
5.09766e-07
-2.8812e-06
-2.35944e-06
2.25912e-06
-2.78967e-06
-4.69913e-06
1.60822e-06
6.9342e-07
4.6225e-07
-1.33276e-06
-3.59033e-06
1.11206e-06
1.83521e-06
2.39163e-06
2.3468e-08
5.91431e-07
-8.80249e-07
-2.77405e-08
-1.13184e-06
-1.28036e-06
1.66229e-06
2.81784e-06
-2.97589e-06
8.73413e-08
1.06439e-06
2.39075e-06
-2.76974e-06
1.20862e-06
-5.12817e-07
-5.19104e-07
4.51324e-07
-4.7168e-07
2.35608e-06
5.46906e-07
-1.66748e-06
5.85236e-07
6.42944e-07
2.43164e-07
4.01031e-07
-1.93646e-06
2.07416e-06
-1.16116e-06
4.27155e-07
5.2951e-07
9.09149e-07
-8.71887e-08
-1.5564e-09
1.07266e-06
-9.49402e-08
2.04016e-06
-6.38123e-07
-1.94241e-06
-5.17294e-06
-2.18622e-06
-8.26703e-06
2.54364e-06
4.32614e-06
8.3847e-07
-2.85309e-06
2.72345e-06
-3.42752e-06
-1.36871e-07
2.23346e-06
5.26825e-07
1.3566e-06
-2.17111e-06
2.1463e-07
2.06479e-06
1.76929e-06
-1.2655e-06
-1.3797e-06
3.10706e-06
-4.72189e-06
4.38138e-06
6.41815e-07
-3.25623e-08
-4.93707e-06
5.05743e-06
5.17578e-07
-5.30524e-06
3.62463e-06
5.68909e-07
1.16226e-06
1.10843e-06
-5.00854e-07
9.48761e-07
-2.18701e-06
-3.57635e-07
4.26709e-06
-1.50836e-06
-5.84412e-06
3.5054e-06
3.94019e-06
-4.7623e-06
2.05856e-06
-2.22992e-07
1.64969e-06
2.64694e-06
-8.49487e-07
-3.63562e-06
1.0386e-06
1.69461e-06
-2.05798e-06
3.60349e-06
3.42651e-07
-1.46686e-06
1.19949e-06
-1.60519e-06
2.37793e-07
6.12366e-07
-1.54669e-06
1.43668e-06
1.87009e-06
-2.22626e-06
2.15155e-06
-3.10571e-06
2.05188e-06
-4.40002e-07
2.06683e-06
-1.11362e-06
5.96924e-07
-2.64471e-06
2.4892e-06
1.13083e-06
-3.23181e-07
5.10651e-07
2.73499e-07
-1.24899e-06
1.40564e-06
-9.3158e-07
1.45947e-06
3.70544e-07
-1.62628e-06
-1.70215e-06
1.72098e-06
8.19031e-07
-5.57709e-07
1.10107e-06
-2.81845e-06
1.57654e-07
3.30716e-06
-9.75403e-07
1.73126e-07
1.30447e-06
7.64771e-08
-6.65344e-07
-1.4346e-06
5.03171e-06
-2.84576e-06
2.3212e-06
-2.73373e-06
2.16675e-08
2.24026e-06
-4.11682e-08
-3.36642e-06
1.78775e-06
1.28174e-08
-9.32068e-07
2.97177e-06
-1.05338e-06
9.42505e-07
2.02362e-07
-1.81326e-06
2.16995e-06
2.83722e-07
-1.2648e-06
9.21814e-07
-8.9447e-07
-1.61597e-06
3.5036e-06
-6.79626e-08
1.52823e-06
-2.98682e-06
5.57404e-07
9.5166e-07
7.10419e-07
-1.28528e-06
-3.76038e-07
-1.03845e-06
2.96631e-06
-1.18356e-06
-2.77313e-07
3.24149e-06
-1.85455e-06
-1.27747e-07
3.6264e-07
4.66431e-07
-1.54443e-06
1.38437e-06
-1.53119e-06
7.4231e-07
-1.2388e-06
1.99774e-06
1.15799e-06
1.39478e-06
-2.93527e-06
-2.03012e-06
2.46667e-06
2.16751e-06
-2.50354e-06
3.95905e-07
5.74371e-07
1.33575e-07
-3.98315e-07
4.93927e-07
-5.23987e-07
-1.74713e-07
6.49384e-07
-7.16766e-07
2.35733e-06
-4.91333e-08
-1.88138e-06
1.74722e-06
4.03503e-07
3.5965e-07
1.44836e-07]
The task you are describing could be treated like anomaly/outlier detection.
One possible solution is to use a Z-score transformation and treat every value with a z score above a certain threshold as an outlier. Because there is no clear definition of an outlier it won't be able to detect such peaks without setting any parameters (threshold).
One possible solution could be:
import numpy as np
def detect_outliers(data):
outliers = []
d_mean = np.mean(data)
d_std = np.std(data)
threshold = 3 # this defines what you would consider a peak (outlier)
for point in data:
z_score = (point - d_mean)/d_std
if np.abs(z_score) > threshold:
outliers.append(point)
return outliers
# create normal data
data = np.random.normal(size=100)
# create outliers
outliers = np.random.normal(100, size=3)
# combine normal data and outliers
full_data = data.tolist() + outliers.tolist()
# print outliers
print(detect_outliers(full_data))
If you only want to detect peaks, remove the np.abs function call from the code.
This code snippet is based on a Medium Post, which also provides another way of detecting outliers.

PACF function in statsmodels.tsa.stattools gives numbers greater than 1 when using ywunbiased?

I have a dataframe which is of length 177 and I want to calculate and plot the partial auto-correlation function (PACF).
I have the data imported etc and I do:
from statsmodels.tsa.stattools import pacf
ys = pacf(data[key][array].diff(1).dropna(), alpha=0.05, nlags=176, method="ywunbiased")
xs = range(lags+1)
plt.figure()
plt.scatter(xs,ys[0])
plt.grid()
plt.vlines(xs, 0, ys[0])
plt.plot(ys[1])
The method used results in numbers greater than 1 for very long lags (90ish) which is incorrect and I get a RuntimeWarning: invalid value encountered in sqrtreturn rho, np.sqrt(sigmasq) but since I can't see their source code I don't know what this means.
To be honest, when I search for PACF, all the examples only carry out PACF up to 40 lags or 60 or so and they never have any significant PACF after lag=2 and so I couldn't compare to other examples either.
But when I use:
method="ols"
# or
method="ywmle"
the numbers are corrected. So it must be the algo they use to solve it.
I tried importing inspect and getsource method but its useless it just shows that it uses another package and I can't find that.
If you also know where the problem arises from, I would really appreciate the help.
For your reference, the values for data[key][array] are:
[1131.130005, 1144.939941, 1126.209961, 1107.300049, 1120.680054, 1140.839966, 1101.719971, 1104.23999, 1114.579956, 1130.199951, 1173.819946, 1211.920044, 1181.27002, 1203.599976, 1180.589966, 1156.849976, 1191.5, 1191.329956, 1234.180054, 1220.329956, 1228.810059, 1207.01001, 1249.47998, 1248.290039, 1280.079956, 1280.660034, 1294.869995, 1310.609985, 1270.089966, 1270.199951, 1276.660034, 1303.819946, 1335.849976, 1377.939941, 1400.630005, 1418.300049, 1438.23999, 1406.819946, 1420.859985, 1482.369995, 1530.619995, 1503.349976, 1455.27002, 1473.98999, 1526.75, 1549.380005, 1481.140015, 1468.359985, 1378.550049, 1330.630005, 1322.699951, 1385.589966, 1400.380005, 1280.0, 1267.380005, 1282.829956, 1166.359985, 968.75, 896.23999, 903.25, 825.880005, 735.090027, 797.869995, 872.8099980000001, 919.1400150000001, 919.320007, 987.4799800000001, 1020.6199949999999, 1057.079956, 1036.189941, 1095.630005, 1115.099976, 1073.869995, 1104.48999, 1169.430054, 1186.689941, 1089.410034, 1030.709961, 1101.599976, 1049.329956, 1141.199951, 1183.26001, 1180.550049, 1257.640015, 1286.119995, 1327.219971, 1325.829956, 1363.609985, 1345.199951, 1320.640015, 1292.280029, 1218.890015, 1131.420044, 1253.300049, 1246.959961, 1257.599976, 1312.410034, 1365.680054, 1408.469971, 1397.910034, 1310.329956, 1362.160034, 1379.319946, 1406.579956, 1440.670044, 1412.160034, 1416.180054, 1426.189941, 1498.109985, 1514.680054, 1569.189941, 1597.569946, 1630.73999, 1606.280029, 1685.72998, 1632.969971, 1681.550049, 1756.540039, 1805.810059, 1848.359985, 1782.589966, 1859.449951, 1872.339966, 1883.949951, 1923.569946, 1960.22998, 1930.6700440000002, 2003.369995, 1972.290039, 2018.050049, 2067.560059, 2058.899902, 1994.9899899999998, 2104.5, 2067.889893, 2085.51001, 2107.389893, 2063.110107, 2103.840088, 1972.180054, 1920.030029, 2079.360107, 2080.409912, 2043.939941, 1940.2399899999998, 1932.22998, 2059.73999, 2065.300049, 2096.949951, 2098.860107, 2173.600098, 2170.949951, 2168.27002, 2126.149902, 2198.810059, 2238.830078, 2278.8701170000004, 2363.639893, 2362.719971, 2384.199951, 2411.800049, 2423.409912, 2470.300049, 2471.649902, 2519.360107, 2575.26001, 2584.840088, 2673.610107, 2823.810059, 2713.830078, 2640.8701170000004, 2648.050049, 2705.27002, 2718.3701170000004, 2816.290039, 2901.52002, 2913.97998]
Your time series is pretty clearly not stationary, so that Yule-Walker assumptions are violated.
More generally, PACF is usually appropriate with stationary time series. You might difference your data first, before considering the partial autocorrelations.

Calculate the average of Spearman correlation

I have 2 columns A and B which contain the Spearman's correlation values as follows:
0.127272727 -0.260606061
-0.090909091 -0.224242424
0.345454545 0.745454545
0.478787879 0.660606061
-0.345454545 -0.333333333
0.151515152 -0.127272727
0.478787879 0.660606061
-0.321212121 -0.284848485
0.284848485 0.515151515
0.36969697 -0.139393939
-0.284848485 0.272727273
How can I calculate the average of those correlation values in these 2 columns in Excel or Matlab ? I found a close answer in this link : https://stats.stackexchange.com/questions/8019/averaging-correlation-values
The main point is we can not use mean or average in this case, as explained in the link. They proposed a nice way to do that, but I dont know how to implement it in Excel or Matlab.
Following the second answer of the link you provided, which is the most general case, you can calculate the average Spearman's rho in Matlab as follows:
M = [0.127272727 -0.260606061;
-0.090909091 -0.224242424;
0.345454545 0.745454545;
0.478787879 0.660606061;
-0.345454545 -0.333333333;
0.151515152 -0.127272727;
0.478787879 0.660606061;
-0.321212121 -0.284848485;
0.284848485 0.515151515;
0.36969697 -0.139393939;
-0.284848485 0.272727273];
z = atanh(M);
meanRho = tanh(mean(z));
As you can see it gives mean values of
meanRho =
0.1165 0.1796
whereas the simple mean is quite close:
mean(M)
ans =
0.1085 0.1350
Edit: more information on Fisher's transformation here.
In MATLAB, define a matrix with these values and use mean function as follows:
%define a matrix M
M = [0.127272727 -0.260606061;
-0.090909091 -0.224242424;
0.345454545 0.745454545;
0.478787879 0.660606061;
-0.345454545 -0.333333333;
0.151515152 -0.127272727;
0.478787879 0.660606061;
-0.321212121 -0.284848485;
0.284848485 0.515151515;
0.36969697 -0.139393939;
-0.284848485 0.272727273];
%calculates the mean of each column
meanVals = mean(M);
Result
meanVals =
0.1085 0.1350
It is also possible to calculate the total meanm and the mean of each row as follows:
meanVals = mean(M); %total mean
meanVals = mean(M,2); %mean of each row

Random Forest feature importance: how many are actually used?

I use RF twice in a row.
First, I fit it using max_features='auto' and the whole dataset (109 feature), in order to perform features selection.
The following is RandomForestClassifier.feature_importances_, it correctly gives me 109 score per each feature:
[0.00118087, 0.01268531, 0.0017589 , 0.01614814, 0.01105567,
0.0146838 , 0.0187875 , 0.0190427 , 0.01429976, 0.01311706,
0.01702717, 0.00901344, 0.01044047, 0.00932331, 0.01211333,
0.01271825, 0.0095337 , 0.00985686, 0.00952823, 0.01165877,
0.00193286, 0.0012602 , 0.00208145, 0.00203459, 0.00229907,
0.00242616, 0.00051358, 0.00071606, 0.00975515, 0.00171034,
0.01134927, 0.00687018, 0.00987706, 0.01507474, 0.01223525,
0.01170495, 0.00928417, 0.01083082, 0.01302036, 0.01002457,
0.00894818, 0.00833564, 0.00930602, 0.01100774, 0.00818604,
0.00675784, 0.00740617, 0.00185461, 0.00119627, 0.00159034,
0.00154336, 0.00478926, 0.00200773, 0.00063574, 0.00065675,
0.01104192, 0.00246746, 0.01663812, 0.01041134, 0.01401842,
0.02038318, 0.0202834 , 0.01290935, 0.01476593, 0.0108275 ,
0.0118773 , 0.01050919, 0.0111477 , 0.00684507, 0.01170021,
0.01291888, 0.00963295, 0.01161876, 0.00756015, 0.00178329,
0.00065709, 0. , 0.00246064, 0.00217982, 0.00305187,
0.00061284, 0.00063431, 0.01963523, 0.00265208, 0.01543552,
0.0176546 , 0.01443356, 0.01834896, 0.01385694, 0.01320648,
0.00966011, 0.0148321 , 0.01574166, 0.0167107 , 0.00791634,
0.01121442, 0.02171706, 0.01855552, 0.0257449 , 0.02925843,
0.01789742, 0. , 0. , 0.00379275, 0.0024365 ,
0.00333905, 0.00238971, 0.00068355, 0.00075399]
Then, I transform the dataset over the previous fit which should reduce its dimensionality, and then i re-fit RF over it.
Given max_features='auto' and the 109 feats, I would expect to have in total ~10 features instead, calling rf.feats_importance_, returns more (62):
[ 0.01261971, 0.02003921, 0.00961297, 0.02505467, 0.02038449,
0.02353745, 0.01893777, 0.01932577, 0.01681398, 0.01464485,
0.01672119, 0.00748981, 0.01109461, 0.01116948, 0.0087081 ,
0.01056344, 0.00971319, 0.01532258, 0.0167348 , 0.01601214,
0.01522208, 0.01625487, 0.01653784, 0.01483562, 0.01602748,
0.01522369, 0.01581573, 0.01406688, 0.01269036, 0.00884105,
0.02538574, 0.00637611, 0.01928382, 0.02061512, 0.02566056,
0.02180902, 0.01537295, 0.01796305, 0.01171095, 0.01179759,
0.01371328, 0.00811729, 0.01060708, 0.015717 , 0.01067911,
0.01773623, 0.0169396 , 0.0226369 , 0.01547827, 0.01499467,
0.01356075, 0.01040735, 0.01360752, 0.01754145, 0.01446933,
0.01845195, 0.0190799 , 0.02608652, 0.02095663, 0.02939744,
0.01870901, 0.02512201]
Why? Shouldn't it returns just ~10 features importances?
You misunderstood the meaning of max_features, which is
The number of features to consider when looking for the best split
It is not the number of features when transforming the data.
It is the threshold in transform method that determines the most important features.
threshold : string, float or None, optional (default=None)
The threshold value to use for feature selection. Features whose importance is greater or equal are kept while the others are discarded. If “median” (resp. “mean”), then the threshold value is the median (resp. the mean) of the feature importances. A scaling factor (e.g., “1.25*mean”) may also be used. If None and if available, the object attribute threshold is used. Otherwise, “mean” is used by default.

scikit-learn roc_curve: why does it return a threshold value = 2 some time?

Correct me if I'm wrong: the "thresholds" returned by scikit-learn's roc_curve should be an array of numbers that are in [0,1]. However, it sometimes gives me an array with the first number close to "2". Is it a bug or I did sth wrong? Thanks.
In [1]: import numpy as np
In [2]: from sklearn.metrics import roc_curve
In [3]: np.random.seed(11)
In [4]: aa = np.random.choice([True, False],100)
In [5]: bb = np.random.uniform(0,1,100)
In [6]: fpr,tpr,thresholds = roc_curve(aa,bb)
In [7]: thresholds
Out[7]:
array([ 1.97396826, 0.97396826, 0.9711752 , 0.95996265, 0.95744405,
0.94983331, 0.93290463, 0.93241372, 0.93214862, 0.93076592,
0.92960511, 0.92245024, 0.91179548, 0.91112166, 0.87529458,
0.84493853, 0.84068543, 0.83303741, 0.82565223, 0.81096657,
0.80656679, 0.79387241, 0.77054807, 0.76763223, 0.7644911 ,
0.75964947, 0.73995152, 0.73825262, 0.73466772, 0.73421299,
0.73282534, 0.72391126, 0.71296292, 0.70930102, 0.70116428,
0.69606617, 0.65869235, 0.65670881, 0.65261474, 0.6487222 ,
0.64805644, 0.64221486, 0.62699782, 0.62522484, 0.62283401,
0.61601839, 0.611632 , 0.59548669, 0.57555854, 0.56828967,
0.55652111, 0.55063947, 0.53885029, 0.53369398, 0.52157349,
0.51900774, 0.50547317, 0.49749635, 0.493913 , 0.46154029,
0.45275916, 0.44777116, 0.43822067, 0.43795921, 0.43624093,
0.42039077, 0.41866343, 0.41550367, 0.40032843, 0.36761763,
0.36642721, 0.36567017, 0.36148354, 0.35843793, 0.34371331,
0.33436415, 0.33408289, 0.33387442, 0.31887024, 0.31818719,
0.31367915, 0.30216469, 0.30097917, 0.29995201, 0.28604467,
0.26930354, 0.2383461 , 0.22803687, 0.21800338, 0.19301808,
0.16902881, 0.1688173 , 0.14491946, 0.13648451, 0.12704826,
0.09141459, 0.08569481, 0.07500199, 0.06288762, 0.02073298,
0.01934336])
Most of the time these thresholds are not used, for example in calculating the area under the curve, or plotting the False Positive Rate against the True Positive Rate.
Yet to plot what looks like a reasonable curve, one needs to have a threshold that incorporates 0 data points. Since Scikit-Learn's ROC curve function need not have normalised probabilities for thresholds (any score is fine), setting this point's threshold to 1 isn't sufficient; setting it to inf is sensible but coders often expect finite data (and it's possible the implementation also works for integer thresholds). Instead the implementation uses max(score) + epsilon where epsilon = 1. This may be cosmetically deficient, but you haven't given any reason why it's a problem!
From the documentation:
thresholds : array, shape = [n_thresholds]
Decreasing thresholds on the decision function used to compute
fpr and tpr. thresholds[0] represents no instances being predicted
and is arbitrarily set to max(y_score) + 1.
So the first element of thresholds is close to 2 because it is max(y_score) + 1, in your case thresholds[1] + 1.
this seems like a bug to me - in roc_curve(aa,bb), 1 is added to the first threshold. You should create an issue here https://github.com/scikit-learn/scikit-learn/issues

Resources