Machine Learning - Train a model using imbalanced data - scikit-learn

I have two classes in my data.
This is how class distribution looks like.
0.0 169072
1.0 84944
In other words, I have 2:1 class distribution.
I believe I have two choices. Downsample the class 0.0 or upsample class 1.0. If I go with option 1, I'm losing data. If i go with option 2, then I'm using non-real data.
Is there a way, I can train the model without upsample or downsample?
This is how my classification_report looks like.
precision recall f1-score support
0.0 0.68 1.00 0.81 51683
1.0 1.00 0.00 0.00 24522
accuracy 0.68 76205
macro avg 0.84 0.50 0.40 76205
weighted avg 0.78 0.68 0.55 76205

Your data is slightly imbalanced yes, but it does not mean that you only have one of the two options (under or over sample your data). You can leave the data as is and apply cost sensitive training in your model. For example, if in your case the classes have a match of 2:1, then you need to give a weight of 2 to your minority class. In the example of an XGBoost classifier, this argument is called scale_pos_weight. See more in this excellent tutorial.
Regarding model evaluation, you should use a classification report to have a full intuition of your model's true and false predictions (precision and recall are your two best friends in this process!).

I would not recommend either approach.
I'm thinking about models to detect fraud. By definition, fraud should be a small percentage of outcomes - on the order of 1-5%. Changing the percentage for training would be a gross distortion of the problem being solved.
Better to leave the proportions as they are.
Make sure that your train, validation, and test data sets all have ratios that reflect the real problem.
Adjust your outcome instead. Don't go for accuracy. A naive model that assumes the 0 outcome will be correct 2/3rds of the time. You want your model to be better than that or a weighted coin flip.
I'd recommend using recall as your criterion for success.

Related

What could be some reasons for a low accuracy score in a Logistic Regression model?

I am building a ML project on Bank Default classification and getting low accuracy score of 56% and precision of 61%. I know this can be improved further. Please, let me know some factors that could result in improving the score.
The data is about loan repayment and I have to classify the customers as Defaulter or Non Defaulters.My train-test ratio is 70%:30%.
' X=loan.drop("loan_status",axis=1)
Y=loan["loan_status"]'
Xtrain,Xtest,Ytrain,Ytest=train_test_split(X,Y,test_size=0.30,random_state=6)
loan_model=LogisticRegression()
Prediction=loan_model.predict(Xtest)
Classification Report:
precision recall f1-score support
0 0.41 0.22 0.29 59
1 0.61 0.79 0.69 91
micro avg 0.57 0.57 0.57 150
macro avg 0.51 0.51 0.49 150
weighted avg 0.53 0.57 0.53 150
Confusion matrix:
[[13 46]
[19 72]]
Accuracy Score:
Accuracy Score is: 0.5666666666666667
Without seeing your data or code, we can only give you general advice on improving your accuracy.
The followings are two main ones that come to my mind,
1- Low accuracy on a classification means your classes are not very well separable with the current features you have. Remedy for this would be finding more (and better) features.
2- If you have enough observations, try models with more complex decision boundaries such as SVM or NN with deep layers and neurons.

Is it acceptable to scale target values for regressors?

I am getting very high RMSE and MAE for MLPRegressor , ForestRegression and Linear regression with only input variables scaled (30,000+) however when i scale target values aswell i get RMSE (0.2) , i will like to know if that is acceptable thing to do.
Secondly is it normal to have better R squared values for Test (ie. 0.98 and 0.85 for train)
Thank You
Answering your first question, I think you are quite deceived by the performance measures which you have chosen to evaluate your model with. Both RMSE and MAE are sensitive to the range in which you measure your target variables, if you are going to scale down your target variable then for sure the values of RMSE and MAE will go down, lets take an example to illustrate that.
def rmse(y_true, y_pred):
return np.sqrt(np.mean(np.square(y_true - y_pred)))
def mae(y_true, y_pred):
return np.mean(np.abs(y_true - y_pred))
I have written two functions for computing both RMSE and MAE. Now lets plug in some values and see what happens,
y_true = np.array([2,5,9,7,10,-5,-2,2])
y_pred = np.array([3,4,7,9,8,-3,-2,1])
For the time being let's assume that the true and the predicted vales are as shown above. Now we are ready to compute RMSE and MAE for this data.
rmse(y_true,y_pred)
1.541103500742244
mae(y_true, y_pred)
1.375
Now let's scale down our target variable by a factor of 10 and compute the same measure again.
y_scaled_true = np.array([2,5,9,7,10,-5,-2,2])/10
y_scaled_pred = np.array([3,4,7,9,8,-3,-2,1])/10
rmse(y_scaled_true,y_scaled_pred)
0.15411035007422444
mae(y_scaled_true,y_scaled_pred)
0.1375
We can now very well see that just by scaling our target variable our RMSE and MAE scores have dropped creating an illusion that our model has improved, but actually NOT. When we scale back our model's predictions we are into the same state.
So coming to the point, MAPE (Mean Absolute Percentage Error) could be a better way to measure your performance of the model and it is insensitive to the scale in which the variables are measure. If you compute MAPE for both the sets of values we see that they are same,
def mape(y, y_pred):
return np.mean(np.abs((y - y_pred)/y))
mape(y_true,y_pred)
0.28849206349206347
mape(y_scaled_true,y_scaled_pred)
0.2884920634920635
So it is better to rely on MAPE over MAE or RMSE, if you want your performance measure to be independent on the scale in which they are measured.
Answering your second question, since you are dealing with some complicated models like MLPRegressor and ForestRegression which has some hyper-parameters which needs to be tuned to avoid over fitting, the best way to find the ideal levels of the hyper-parameters is to divide the data into train, test and validation and use techniques like K-Fold Cross Validation to find the optimal setting. It is quite difficult to say if the above values are acceptable or not just by looking at this one case.
It is actually a common practice to scale target values in many cases.
For example a highly skewed target may give better results if it is applied log or log1p transforms. I don't know the characteristics of your data, but there could a transformation that might decrease your RMSE.
Secondly, Test set is meant to be a sample of unseen data, to give a final estimate of your model's performance. When you see the unseen data and tune to perform better on it, it becomes a cross validation set.
You should try to split your data into three parts, Train, Cross-validation and test sets. Train on your data and tune parameters according to it's performance on cross validation and then after you are done tuning, run it on the test set to get a prediction of how it works on unseen data and mark it as the accuracy of your model.

How do I test my classifier accuracy against random values?

I've set up my first scikit-learn example to play with and I'm trying to gauge accuracy on my predictions. I've got training and test lists set up fine, but I'm getting ~0.95 accuracy even if I give it random values.
This looks to be because I'm checking for 0/1 labels, and 95% of the labels are zero's, so it's guessing on 0's and getting 0.95 accuracy (I think?). Obviously this isn't what I want.
How do I go about deciding if my classifiers are working, and how do I get meaningful accuracy values?
You have a clear class imbalance issue. Your classifier is predicting 0 all the time knowing it will be right 95% of the time. You can inspect this by calling predict(X_test) on your fitted classifier. If all the values are 0 you know this is the case.
To get a better idea on how the model performs you can upsample the data labelled with 1 or down sample the data labelled with 0. You can use this package which builds off scikit-learn and implements a number of resampling methods. Alternatively, you can use scikit learns resampling method. Which will bootstrap new data points for you.

Is there a way to do fit_priors=False with pyspark.ml.classification.NaiveBayes

In the scikit version of Multinomial Naive bayes there is a parameter for fit_prior.
I have found that for unbalanced datasets usually setting this to false is desired.
For my particular use case setting it raised my AUC from 0.52 to 0.61.
However in the pyspark.ml.classification.NaiveBayes there is no such setting which I think means it is fitting priors regardless.
I "think" I can counteract this with the thresholds param to essentially give more weight to the minority class.
In my case the breakdown is 87% negative and 13% positive.
If I can indeed use the thresholds to in effect do fit_prior to false what value should I use.
Would it be 13/18 ~ 0.15 or ....?
i.e I would create it with NaiveBayes(thresholds=[1,.15])
Or am I completely off base with this?

Text classification precision and recall

Basically I had to do an application for classifying documents based on the part of the speech of the vocabulary of the words. The algorithm that was used for learning the classification problem was ready made and handed over for me.
Based on the examples I got, I need to interpret these results( precision, recall, accuracy). Can someone say his opinion if these results are good or not?
accuracy = 0.91 (true positive + true negative)/all
f-measure = 0.34
precision = 0.45
recall = 0.33
negative rate = 0.92

Resources