Understanding the results from sklearn logistic regression + random feature elimination - scikit-learn

I am training a Logistic Regression classifier in sklearn and using RFECV to reduce the number of features. I have 10,000 data items with 3000 features each. Using RFECV, I get 106 features. The code for this is shown below:
clf = LogisticRegression(solver='lbfgs', max_iter=10000)
rfecv = RFECV(clf, step=0.1, verbose=True)
rfecv = rfecv.fit(X_train, y_train)
X_train = X_train[:, rfecv.support_]
clf.fit(X_train, y_train)
And my stats (accuracy, precision, recall and F1 score) all improve (a bit) with 106 vs. 3000 features. However, I did another test to see if I could zero out some of these 106 coefficients. So I just set all the coefficients whose absolute value is below a certain threshold to 0 and I see this:
As I increase the threshold, I do zero out more weights (the percentage of 0 weights are the diamond points). The max absolute value of coefficients is 1.62 so zeroing everything about 1.7 means 100% of the coefficients are 0, so the 1.6 and 1.7 lines make sense.
But it seems like the stats stay pretty steady as I zero more weights till threshold = 1.2 or so. But I am zeroing out 93% of the coefficients by this point. I thought I would see a more gradual decrease in the stats from 0 till 1.6 but it seems like there is a sharp change only at around 1.2-1.3.
So am I doing something wrong with sklearn and how to use RFECV? Or is this something about logistic regression that I'm not understanding. Or is it just that for this dataset, I can actually predict the class just as well with 3000 or 100 or just 5 features?

Related

How does model.evaluate in Keras work and how to recreate it manually?

How does Keras' model.evaluate() work? In particular, how does the batch_size argument affect the calculation?
The documentation says the loss/metrics are calculated as averages over the batches. However, as there is only one scalar output for each loss/metrics (which should represent the total average over all data, over all batch-averages), the result should not depend on batch_size choice (at least as long as the sum of all data is dividable by the batch_size).
But after building and training a network (consisting of Conv2D, Conv2DTranspose, MaxPooling2D, and BatchNormalization, using ReLU as activations), I tried evaluating it on my test set of 60 samples:
Evaluation with batch_size = 60 gave loss 0.1375531554222107
Evaluation with batch_size = 10 gave loss 0.1381820539633433
Evaluation with batch_size = 3 gave loss 0.14014312624931335
Evaluation with batch_size = 1 gave loss 0.15437299211819966
The entire dataset (60 samples) was dividable by all of the batch_size's here (1,3,10,60). Yet the outputs vary a lot. Could be due to batch normalization, but I don't think so because if I run the evaluation multiple times, the numbers will always be the same.
Even if there would be no shuffling before defining batches, and the numbers were supposed to always be the same, it still doesn't explain why I can't then reproduce the last number (the evaluation with batch_size=1) by averaging individual sample evaluations. That is, why the following two lines produce a very different result:
model.evaluate(testset, testlabels, batch_size=1)
and:
losses = [model.evaluate(testset[i], testlabel[i], batch_size=1) for i in range(60)]
np.mean(losses)
Here, assume of course that model.evaluate returns only 1 scalar, that is the final loss, and no metrics or intermediate losses. In summary - how does keras.evaluate work internally; how are results based on different batch_size's connected and what is their relationship to the simple averaging of per-sample losses?
This similar question doesn't help (asks after optimal choice, while I just want to know how it works).

Why are my r^2 values so consistently negative?

I'm not sure if the problem is with my regression estimator models, or with my understanding of what the r^2 measure-of-fittedness actually means. I am working on a project using scikit learn and ~11 different regression estimators in order to produce (rough!) predictions of baseball fantasy performance. Certain models always fare better than others (Decision Tree Regression and Extra Tree Regression produce the worst r^2 scores, while ElasticCV and LassoCV produce the best r^2 scores and every once in a while might even be a slightly positive number!).
If a horizontal line produces an r^2 score of 0, then even if all my models were worthless, and literally have zero predictive value, and are spitting out numbers entirely at random, shouldn't then I get small positive numbers for r^2 sometimes, if from sheer dumb luck alone? 8 of the 11 estimators i use, despite running over different datasets hundreds of times, have never once produced even a tiny positive number for r^2.
Am I misunderstanding how r^2 works?
I am not switching the order in sklearn's .score function either. I have double checked this many times. When I do put the order of y_pred, y_true in the wrong way, it yields r^2 values that are hugely negative (like <-50 big)
The fact that thats the case actually lends more to my confusion as to how r^2 here is a measure of fittedness, but I digress...
## I don't know whether I'm supposed to include my df4 or even a
##sample, but suffice to say here is just a single row to show what
##kind of data we have. It is all normalized and/or zscore'd
"""
>> print(df4.head(1))
HomeAway ParkFactor Salary HandedVs Hand oppoBullpen \
Points
3.0 1.0 -1.229 -0.122111 1.0 0.0 -0.90331
RibRunHistory BibTibHistory GrabBagHistory oppoTotesRank \
Points
3.0 0.964943 0.806874 -0.224993 -0.846859
oppoSwipesRank oppoWalksRank Temp Precip WindSpeed \
Points
3.0 -1.40371 -1.159115 -0.665324 -0.380048 -0.365671
WindDirection oppoPositFantasy oppoFantasy
Points
3.0 0.229944 -1.011505 0.919269
"""
def ElasticNetValidation(df4):
X = df4.values
y = df4.index
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
ENTrain = ElasticNetCV(cv=20)
ENTrain.fit(X_train, y_train)
y_pred = ENTrain.predict(X_test)
EN = ElasticNetCV(cv=20)
ENModel = EN.fit(X, y)
print('ElasticNet R^2: ' + str(r2_score(y_test, y_pred)))
scores = cross_val_score(ENModel, X, y, cv=20)
print("ElasticNet Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
return ENModel
When i run this estimator, along with ten other regression estimators I have been experimenting with, I get both r2_score() and cross_val_score().mean() showing negative numbers nearly every time. Certain estimators ALWAYS produce negative scores that are not even close to zero (decision tree regressor, extra tree regressor). Certain estimators fare better and even sometimes produce a tiny positive score, never more than 0.01 though, and even those estimators (elasticCV, lassoCV, linearRegression) are negative most of the time, albeit only slightly negative.
Even if these models I'm building are horrible. SAy they are totally random and have no predictive power whatsoever when it comes to the target: shouldn't it predict better than a plain horizontal line as often as not? How is it that an unrelated model is predicting POORER than a horizontal line so consistently?
You most likely have issues with overfitting. As you mentioned correctly, negative R2 values can occur if your model performs worse than just fitting an intercept term. Your models do probably not capture any 'real' underlying dependence but merely fit random noise. You are calculating the R2 score on a small test set and it is very well possible that this fitting of noise yields consistently worse result than a simple intercept term on the test set.
This is a typical case of bias-variance tradeoff. Your models have low bias and high variance and therefore perform poorly on the test data. There are certain models that aim at reducing overfit / variance, for example the Lasso and Elastic Net. These models actually are among the models that you see performing better.
In order to convince yourself that the sklearn's r2_score function works properly and to get familiarised with it, I would recommend that you first fit and predict your model on training data only (leave out the CV as well). R2 can never be negative in this case. Also make sure that your models include an intercept term (wherever available).

Model evaluation : model.score Vs. ROC curve (AUC indicator)

I want to evaluate a logistic regression model (binary event) using two measures:
1. model.score and confusion matrix which give me a 81% of classification accuracy
2. ROC Curve (using AUC) which gives back a 50% value
Are these two result in contradiction? Is that possible
I'missing something but still can't find it
y_pred = log_model.predict(X_test)
accuracy_score(y_test , y_pred)
cm = confusion_matrix( y_test,y_pred )
y_test.count()
print (cm)
tpr , fpr, _= roc_curve( y_test , y_pred, drop_intermediate=False)
roc = roc_auc_score( y_test ,y_pred)
enter image description here
The accuracy score is calculated based on the assumption that a class is selected if it has a prediction probability of more than 50%. This means that you are looking only at 1 case (one working point) out of many. Let's say you'd like to classify an instance as '0' even if it has a probability greater than 30% (this may happen if one of your classes is more important for you, and its a-priori probability is very low). In this case - you will have a very different confusion matrix with a different accuracy ([TP+TN]/[ALL]). The ROC auc score examines all of these working points and gives you an estimation of your overall model. A score of 50% means that the model is equal to a random selection of classes based on your a-priori probabilities of the classes. You would like the ROC to be much higher to say that you have a good model.
So in the above case - you can say that your model does not have a good prediction strength. As a matter of fact - a better prediction will be to predict everything as "1" - in your case it will lead to an accuracy of above 99%.

Recursive feature elemination with CV doesn't reduce feature count

I have this protein dataset that I need to perform a RFE on. There are 100 examples with binary class labels (sick - 1, healthy - 0) and 9847 features for each example. To reduce the dimensionality I am performing a RFECV with a LogisticRegression estimator and 5 fold CV. This is the code:
model = LogisticRegression()
rfecv = RFECV(estimator=model, step=1, cv=StratifiedKFold(5), n_jobs=-1)
rfecv.fit(X_train, y_train)
print("Number of features selected: %d" % rfecv.n_features_)
Number of features selected: 9874
I then plot the number of features vs the CV scores:
plt.figure()
plt.xlabel("feature count")
plt.ylabel("CV accuracy")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()
What I think is happening (and this is what I need an expert for) is that the first peak shows the optimal number of features. After that the curve drops and only starts to climb again because of overfitting, not really seperating classes but examples. Could this be the case? And if so how can I obtain these features (i.e. the ones at that first peak), because rfecv.support_ only gives me the ones where the highest accuracy was reached (meaning: all of them).
And while I am at it: How would I choose the best estimator for the RFE? Is it just by trial and error, going through all possible classifiers or is there any logic why I would use a Logit over a linear SVC for example?
One way that i use for feature relevance is the RandomForest or ExtremeRandomizedTrees.
i can use:
rfecv.n_features
to see how much features the find and:
rfec.ranking
to see the features index in descending order. another algorithm that you can use is the PCA to reduce the dimension of you Dataset.

Multivariate LSTM Forecast Loss and evaluation

I have a CNN-RNN model architecture with Bidirectional LSTMS for time series regression problem. My loss does not converge over 50 epochs. Each epoch has 20k samples. The loss keeps bouncing between 0.001 - 0.01.
batch_size=1
epochs = 50
model.compile(loss='mean_squared_error', optimizer='adam')
trainingHistory=model.fit(trainX,trainY,epochs=epochs,batch_size=batch_size,shuffle=False)
I tried to train the model with incorrectly paired X and Y data for which the
loss stays around 0.5, is it reasonable conclusion that my X and Y
have a non linear relationship which can be learned by my model over
more epochs ?
The predictions of my model capture the pattern but with an offset, I use dynamic time warping distance to manually check the accuracy of predictions, is there a better way ?
Model :
model = Sequential()
model.add(LSTM(units=128, dropout=0.05, recurrent_dropout=0.35, return_sequences=True, batch_input_shape=(batch_size,featureSteps,input_dim)))
model.add(LSTM(units=32, dropout=0.05, recurrent_dropout=0.35, return_sequences=False))
model.add(Dense(units=2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
If you tested with:
Wrong data: loss ~0.5
Correct data: loss ~0.01
Then your model is actually cabable of learning something.
There are some possibilities there:
Your output data does not fit in the range of the last layer's activation
Your model reached a limit for the current learning rate (gradient update steps are too big and can't improve the model anymore).
Your model is not good enough for the task.
Your data has some degree of random factors
Case 1:
Make sure your Y is within the range of your last activation function.
For a tanh (the LSTM's default), all Y data should be between -1 and + 1
For a sigmoid, between 0 and 1
For a softmax, between 0 and 1, but make sure your last dimension is not 1, otherwise all results will be 1, always.
For a relu, between 0 and infinity
For linear, any value
Convergence goes better if you have a limited activation instead of one that goes to infinity.
In the first case, you can recompile (after training) the model with a lower learning rate, usually we divide it by 10, where the default is 0.0001:
Case 2:
If data is ok, try decreasing the learning rate after your model stagnates.
The default learning rate for adam is 0.0001, we often divide it by 10:
from keras.optimizers import Adam
#after training enough with the default value:
model.compile(loss='mse', optimizer=Adam(lr=0.00001)
trainingHistory2 = model.fit(.........)
#you can even do this again if you notice that the loss decreased and stopped again:
model.compile(loss='mse',optimizer=Adam(lr=0.000001)
If the problem was the learning rate, this will make your model learn more than it already did (there might be some difficult at the beginning until the optimizer adjusts itself).
Case 3:
If you got no success, maybe it's time to increase the model's capability.
Maybe add more units to the layers, add more layers or even change the model.
Case 4:
There's probably nothing you can do about this...
But if you increased the model like in case 3, be careful with overfitting (keep some test data to compare the test loss versus the training loss).
Too good models can simply memorize your data instead of learning important insights about it.

Resources