I have limited knowledge about sample_weights in the sklearn library, but from what I gather, it's generally used to help balance imbalanced datasets during training. What I'm wondering is, if I already have a perfectly balanced binary classification dataset (i.e. equal amounts of 1's and 0's in the label/Y/class column), could one add a sample weight to the 0's in order to put more importance on predicting the 1's correctly?
For example, let's say I really want my model to predict 1's well, and it's ok to predict 0's even though they turn out to be 1's. Would setting a sample_weight of 2 for 0's, and 1 for the 1's be the correct thing to do here in order to put more importance on correctly predicting the 1's? Or does that matter? And then I guess during training, is the f1 scoring function generally accepted as the best metric to use?
Thanks for the input!
ANSWER
After a couple rounds of testing and more research, I've discovered that yes, it does make sense to add more weight to the 0's with a balanced binary classification dataset, if your goal is to decrease the chance of over-predicting the 1's. I ran two separate training sessions using a weight of 2 for 0's and 1 for the 1's, and then again vice versa, and found that my model predicted less 1's when the weight was applied to the 0's, which was my ultimate goal.
In case that helps anyone.
Also, I'm using SKLearn's Balanced Accuracy scoring function for those tests, which takes an average of each separate class's accuracy.
Related
I am currently fitting a neural network to predict a continuous target from 1 to 10. However, the samples are not evenly distributed over the entire data set: samples with target ranging from 1-3 are quite underrepresented (only account for around 5% of the data). However, they are of big interest, since the low range of the target is kind of the critical range.
Is there any way to know how my model predicts these low range samples in particular? I know that when doing multiclass classification I can examine the recall to get a taste of how well the model performs on a certain class. For classification use cases I can also set the class weight parameter in Keras to account for class imbalances, but this is obviously not possible for regression.
Until now, I use typical metrics like MAE, MSE, RMSE and get satisfying results. I would however like to know how the model performs on the "critical" samples.
From my point of view, I would compare the test measurements (classification performance, MSE, RMSE) for the whole test step that corresponds to the whole range of values (1-10). Then, of course, I would do it separately to the specific range that you are considering critical (let's say between 1-3) and compare the divergence of the two populations. You can even perform some statistics about the significance of the difference between the two populations (Wilcoxon tests etc.).
Maybe this link could be useful for your comparisons. Since you can regression you can even compare for MSE and RMSE.
What you need to do is find identifiers for these critical samples. Often times row indices are used for this. Once you have predicted all of your samples, use those stored indices to find the critical samples in your predictions and run whatever automatic metric over those filtered samples. I hope this answers your question.
I am working on a time-series prediction problem using GradientBoostingRegressor, and I think I'm seeing significant overfitting, as evidenced by a significantly better RMSE for training than for prediction. In order to examine this, I'm trying to use sklearn.model_selection.cross_validate, but I'm having problems understanding the result.
First: I was calculating RMSE by fitting to all my training data, then "predicting" the training data outputs using the fitted model and comparing those with the training outputs (the same ones I used for fitting). The RMSE that I observe is the same order of magnitude the predicted values and, more important, it's in the same ballpark as the RMSE I get when I submit my predicted results to Kaggle (although the latter is lower, reflecting overfitting).
Second, I use the same training data, but apply sklearn.model_selection.cross_validate as follows:
cross_validate( predictor, features, targets, cv = 5, scoring = "neg_mean_squared_error" )
I figure the neg_mean_squared_error should be the square of my RMSE. Accounting for that, I still find that the error reported by cross_validate is one or two orders of magnitude smaller than the RMSE I was calculating as described above.
In addition, when I modify my GradientBoostingRegressor max_depth from 3 to 2, which I would expect reduces overfitting and thus should improve the CV error, I find that the opposite is the case.
I'm keenly interested to use Cross Validation so I don't have to validate my hyperparameter choices by using up Kaggle submissions, but given what I've observed, I'm not clear that the results will be understandable or useful.
Can someone explain how I should be using Cross Validation to get meaningful results?
I think there is a conceptual problem here.
If you want to compute the error of a prediction you should not use the training data. As the name says theese type of data are used only in training, for evaluating accuracy scores you ahve to use data that the model has never seen.
About cross-validation I can tell that it's an approach to find the best training/testing set. The process is as follows: you divide your data into n groups and you do various iterating changing the testing group you pick. If you have n groups you will do n iteration and each time the training and testing set will be different. It's more understamdable in the image below.
Basically what you should do it's kile this:
Train the model using months from 0 to 30 (for example)
See the predictions made with months from 31 to 35 as input.
If the input has to be the same lenght divide feature in half (should be 17 months).
I hope I understood correctly, othewise comment.
I am trying to build a model on a class imbalanced dataset (binary - 1's:25% and 0's 75%). Tried with Classification algorithms and ensemble techniques. I am bit confused on below two concepts as i am more interested in predicting more 1's.
1. Should i give preference to Sensitivity or Positive Predicted Value.
Some ensemble techniques give maximum 45% of sensitivity and low Positive Predicted Value.
And some give 62% of Positive Predicted Value and low Sensitivity.
2. My dataset has around 450K observations and 250 features.
After power test i took 10K observations by Simple random sampling. While selecting
variable importance using ensemble technique's the features
are different compared to the features when i tried with 150K observations.
Now with my intuition and domain knowledge i felt features that came up as important in
150K observation sample are more relevant. what is the best practice?
3. Last, can i use the variable importance generated by RF in other ensemple
techniques to predict the accuracy?
Can you please help me out as am bit confused on which w
The preference between Sensitivity and Positive Predictive value depends on your ultimate goal of the analysis. The difference between these two values is nicely explained here: https://onlinecourses.science.psu.edu/stat507/node/71/
Altogether, these are two measures that look at the results from two different perspectives. Sensitivity gives you a probability that a test will find a "condition" among those you have it. Positive Predictive value looks at the prevalence of the "condition" among those who is being tested.
Accuracy is depends on the outcome of your classification: it is defined as (true positive + true negative)/(total), not variable importance's generated by RF.
Also, it is possible to compensate for the imbalances in the dataset, see https://stats.stackexchange.com/questions/264798/random-forest-unbalanced-dataset-for-training-test
I am using Spark ML to optimise a Naive Bayes multi-class classifier.
I have about 300 categories and I am classifying text documents.
The training set is balanced enough and there is about 300 training examples for each category.
All looks good and the classifier is working with acceptable precision on unseen documents. But what I am noticing that when classifying a new document, very often, the classifier assigns a high probability to one of the categories (the prediction probability is almost equal to 1), while the other categories receive very low probabilities (close to zero).
What are the possible reasons for this?
I would like to add that in SPARK ML there is something called "raw prediction" and when I look at it, I can see negative numbers but they have more or less comparable magnitude, so even the category with the high probability has comparable raw prediction score, but I am finding difficulties in interpreting this scores.
Lets start with a very informal description of Naive Bayes classifier. If C is a set of all classes and d is a document and xi are the features, Naive Bayes returns:
Since P(d) is the same for all classes we can simplify this to
where
Since we assume that features are conditionally independent (that is why it is naive) we can further simplify this (with Laplace correction to avoid zeros) to:
Problem with this expression is that in any non-trivial case it is numerically equal to zero. To avoid we use following property:
and replace initial condition with:
These are the values you get as the raw probabilities. Since each element is negative (logarithm of the value in (0, 1]) a whole expression has negative value as well. As you discovered by yourself these values are further normalized so the maximum value is equal to 1 and divided by the sum of the normalized values
It is important to note that while values you get are not strictly P(c|d) they preserve all important properties. The order and ratios are exactly (ignoring possible numerical issues) the same. If none other class gets prediction close to one it means that, given the evidence, it is a very strong prediction. So it is actually something you want to see.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am working myself through a book on machine learning right now.
Working on a NaiveBayesClassifier the author is very much in favour of the cross-validation method.
He proposes to split the data into ten buckets (files) and train on nine of them each time withholding a different bucket.
Up to now the only approach I am familiar with is to split the data into a training set and a test set in the ratio of 50%/50% and simply train the classifier all at once.
Could someone explain what are possible advantages of using cross-validation?
Cross-validation is a way to address the tradeoff between bias and variance.
When you obtain a model on a training set, your goal is to minimize variance. You can do this by adding more terms, higher order polynomials, etc.
But your true objective is to predict outcomes for points that your model has never seen. That's what holding out the test set simulates.
You'll create your model on a training set, then try it out on a test set. You will find that there's a minimum combination of variance and bias that will give your best results. The simplest model that minimizes both should be your choice.
I'd recommend "An Intro to Statistical Learning" or "Elements of Statistical Learning" by Hastie and Tibshirani for more details.
The general objective of machine learning is that the more the data you have for training, the better results you would get. This was important to state before i start answering the question.
Cross-validation helps us to avoid overfitting of the model and it also helps to increase the generalization accuracy which is the accuracy of the model on unseen future point. Now when you divide your dataset into dtrain and dtest there is one problem with that which is if your function that would be determined once you have trained your model needs both training and testing data, then you cannot say your accuracy on future unseen point would be same as the accuracy you got on your test data. This above argument can be stated by taking the example of k-nn where nearest neighbour is determined with the help of training data while the value of k is determined by test data.
But if you use CV then k could be determined by CV data and your test data can be considered as the unseen data point.
Now suppose you divide your dataset into 3 parts Dtrain(60%), Dcv(20%) and Dtest(20%). Now you have only 60% of data to train with. Now suppose you want to use all the 80% of your data to train then you can do this with the help of m-fold cross validation. In m-fold CV you divide your data into two parts Dtrain and Dtest (lets say 80 and 20).
lets say the value of m is 4 so you divide the training data into 4 equal parts randomly (d1,d2,d3,d4). Now start training the model by taking d1,d2,d3 as dtrain and d4 as cv and calculate the accuracy, in the next go take d2,d3,d4 as dtrain and d1 as cv and likewise take all possiblity for m=1, then for m=2 continue the same procedure. With this you use your entire 80% of your dtrain and your dtest can be considered as future unseen dataset.
The advantages are better and more use of your dtrain data, reduce the overfit and helps to give you sured generalization accuracy.
but on the downside the time complexity is high.
In your case the value of m is 10.
Hope this helps.
The idea is to have maximum no. of points to train the model to achieve accurate results.For every data point chosen in the train set, it is excluded from the test set. So we use the concept of k and k-1 where we firstly divide the data set into equal k sized bins and we take one bin make it a test set and the remaining k-1 bins represent train set. We repeat the process till all the bins are selected once as test set(k) and the remaining as training(k-1).Doing this no data point in missed out for training purpose