I am building a ML project on Bank Default classification and getting low accuracy score of 56% and precision of 61%. I know this can be improved further. Please, let me know some factors that could result in improving the score.
The data is about loan repayment and I have to classify the customers as Defaulter or Non Defaulters.My train-test ratio is 70%:30%.
' X=loan.drop("loan_status",axis=1)
Y=loan["loan_status"]'
Xtrain,Xtest,Ytrain,Ytest=train_test_split(X,Y,test_size=0.30,random_state=6)
loan_model=LogisticRegression()
Prediction=loan_model.predict(Xtest)
Classification Report:
precision recall f1-score support
0 0.41 0.22 0.29 59
1 0.61 0.79 0.69 91
micro avg 0.57 0.57 0.57 150
macro avg 0.51 0.51 0.49 150
weighted avg 0.53 0.57 0.53 150
Confusion matrix:
[[13 46]
[19 72]]
Accuracy Score:
Accuracy Score is: 0.5666666666666667
Without seeing your data or code, we can only give you general advice on improving your accuracy.
The followings are two main ones that come to my mind,
1- Low accuracy on a classification means your classes are not very well separable with the current features you have. Remedy for this would be finding more (and better) features.
2- If you have enough observations, try models with more complex decision boundaries such as SVM or NN with deep layers and neurons.
I am getting very high RMSE and MAE for MLPRegressor , ForestRegression and Linear regression with only input variables scaled (30,000+) however when i scale target values aswell i get RMSE (0.2) , i will like to know if that is acceptable thing to do.
Secondly is it normal to have better R squared values for Test (ie. 0.98 and 0.85 for train)
Thank You
Answering your first question, I think you are quite deceived by the performance measures which you have chosen to evaluate your model with. Both RMSE and MAE are sensitive to the range in which you measure your target variables, if you are going to scale down your target variable then for sure the values of RMSE and MAE will go down, lets take an example to illustrate that.
def rmse(y_true, y_pred):
return np.sqrt(np.mean(np.square(y_true - y_pred)))
def mae(y_true, y_pred):
return np.mean(np.abs(y_true - y_pred))
I have written two functions for computing both RMSE and MAE. Now lets plug in some values and see what happens,
y_true = np.array([2,5,9,7,10,-5,-2,2])
y_pred = np.array([3,4,7,9,8,-3,-2,1])
For the time being let's assume that the true and the predicted vales are as shown above. Now we are ready to compute RMSE and MAE for this data.
rmse(y_true,y_pred)
1.541103500742244
mae(y_true, y_pred)
1.375
Now let's scale down our target variable by a factor of 10 and compute the same measure again.
y_scaled_true = np.array([2,5,9,7,10,-5,-2,2])/10
y_scaled_pred = np.array([3,4,7,9,8,-3,-2,1])/10
rmse(y_scaled_true,y_scaled_pred)
0.15411035007422444
mae(y_scaled_true,y_scaled_pred)
0.1375
We can now very well see that just by scaling our target variable our RMSE and MAE scores have dropped creating an illusion that our model has improved, but actually NOT. When we scale back our model's predictions we are into the same state.
So coming to the point, MAPE (Mean Absolute Percentage Error) could be a better way to measure your performance of the model and it is insensitive to the scale in which the variables are measure. If you compute MAPE for both the sets of values we see that they are same,
def mape(y, y_pred):
return np.mean(np.abs((y - y_pred)/y))
mape(y_true,y_pred)
0.28849206349206347
mape(y_scaled_true,y_scaled_pred)
0.2884920634920635
So it is better to rely on MAPE over MAE or RMSE, if you want your performance measure to be independent on the scale in which they are measured.
Answering your second question, since you are dealing with some complicated models like MLPRegressor and ForestRegression which has some hyper-parameters which needs to be tuned to avoid over fitting, the best way to find the ideal levels of the hyper-parameters is to divide the data into train, test and validation and use techniques like K-Fold Cross Validation to find the optimal setting. It is quite difficult to say if the above values are acceptable or not just by looking at this one case.
It is actually a common practice to scale target values in many cases.
For example a highly skewed target may give better results if it is applied log or log1p transforms. I don't know the characteristics of your data, but there could a transformation that might decrease your RMSE.
Secondly, Test set is meant to be a sample of unseen data, to give a final estimate of your model's performance. When you see the unseen data and tune to perform better on it, it becomes a cross validation set.
You should try to split your data into three parts, Train, Cross-validation and test sets. Train on your data and tune parameters according to it's performance on cross validation and then after you are done tuning, run it on the test set to get a prediction of how it works on unseen data and mark it as the accuracy of your model.
I've set up my first scikit-learn example to play with and I'm trying to gauge accuracy on my predictions. I've got training and test lists set up fine, but I'm getting ~0.95 accuracy even if I give it random values.
This looks to be because I'm checking for 0/1 labels, and 95% of the labels are zero's, so it's guessing on 0's and getting 0.95 accuracy (I think?). Obviously this isn't what I want.
How do I go about deciding if my classifiers are working, and how do I get meaningful accuracy values?
You have a clear class imbalance issue. Your classifier is predicting 0 all the time knowing it will be right 95% of the time. You can inspect this by calling predict(X_test) on your fitted classifier. If all the values are 0 you know this is the case.
To get a better idea on how the model performs you can upsample the data labelled with 1 or down sample the data labelled with 0. You can use this package which builds off scikit-learn and implements a number of resampling methods. Alternatively, you can use scikit learns resampling method. Which will bootstrap new data points for you.
In the scikit version of Multinomial Naive bayes there is a parameter for fit_prior.
I have found that for unbalanced datasets usually setting this to false is desired.
For my particular use case setting it raised my AUC from 0.52 to 0.61.
However in the pyspark.ml.classification.NaiveBayes there is no such setting which I think means it is fitting priors regardless.
I "think" I can counteract this with the thresholds param to essentially give more weight to the minority class.
In my case the breakdown is 87% negative and 13% positive.
If I can indeed use the thresholds to in effect do fit_prior to false what value should I use.
Would it be 13/18 ~ 0.15 or ....?
i.e I would create it with NaiveBayes(thresholds=[1,.15])
Or am I completely off base with this?
Basically I had to do an application for classifying documents based on the part of the speech of the vocabulary of the words. The algorithm that was used for learning the classification problem was ready made and handed over for me.
Based on the examples I got, I need to interpret these results( precision, recall, accuracy). Can someone say his opinion if these results are good or not?
accuracy = 0.91 (true positive + true negative)/all
f-measure = 0.34
precision = 0.45
recall = 0.33
negative rate = 0.92