Setting feature weights for KNN - scikit-learn

I am working with sklearn's implementation of KNN. While my input data has about 20 features, I believe some of the features are more important than others. Is there a way to:
set the feature weights for each feature when "training" the KNN learner.
learn what the optimal weight values are with or without pre-processing the data.
On a related note, I understand generally KNN does not require training but since sklearn implements it using KDTrees, the tree must be generated from the training data. However, this sounds like its turning KNN into a binary tree problem. Is that the case?
Thanks.

kNN is simply based on a distance function. When you say "feature two is more important than others" it usually means difference in feature two is worth, say, 10x difference in other coords. Simple way to achive this is by multiplying coord #2 by its weight. So you put into the tree not the original coords but coords multiplied by their respective weights.
In case your features are combinations of the coords, you might need to apply appropriate matrix transform on your coords before applying weights, see PCA (principal component analysis). PCA is likely to help you with question 2.

The answer to question to is called "metric learning" and currently not implemented in Scikit-learn. Using the popular Mahalanobis distance amounts to rescaling the data using StandardScaler. Ideally you would want your metric to take into account the labels.

Related

How can I specify confidence in training data?

I am classifying data with categorical variables. It is data where people have provided information.
My training dataset is of varying quality. I have a greater confidence in some of the data i.e. I have a higher confidence that people have provided correct information whereas in some the data I am not so sure.
How can I pass this information into a classification algorithm such as Naive Bayes or K nearest neighbour?
Or should I instead look to another algorithm?
I think what you want to do, is to provide individual weights (for the importance/confidence) for each data point you have.
For instance, if you are very certain that one data point is of higher quality and should have a higher weight than others, in which you are less confident in, you can specify that when fitting your classifier.
Sklearn provides for instance the Gaussian Naive Bayes classifier (GaussianNB) for that.
Here, you can specify sample_weights when calling the fit() method.

Why does more features in a random forest decrease accuracy dramatically?

I am using sklearn's random forests module to predict values based on 50 different dimensions. When I increase the number of dimensions to 150, the accuracy of the model decreases dramatically. I would expect more data to only make the model more accurate, but more features tend to make the model less accurate.
I suspect that splitting might only be done across one dimension which means that features which are actually more important get less attention when building trees. Could this be the reason?
Yes, the additional features you have added might not have good predictive power and as random forest takes random subset of features to build individual trees, the original 50 features might have got missed out. To test this hypothesis, you can plot variable importance using sklearn.
Your model is overfitting the data.
From Wikipedia:
An overfitted model is a statistical model that contains more parameters than can be justified by the data.
https://qph.fs.quoracdn.net/main-qimg-412c8556aacf7e25b86bba63e9e67ac6-c
There are plenty of illustrations of overfitting, but for instance, this 2d plot represents the different functions that would have been learned for a binary classification task. Because the function on the right has too many parameters, it learns wrongs data patterns that don't generalize properly.

Is there class weight (or alternative way) for GradientBoostingClassifier in Sklearn when dealing with VotingClassifier or Grid search?

I'm using GradientBoostingClassifier for my unbalanced labeled datasets. It seems like class weight doesn't exist as a parameter for this classifier in Sklearn. I see I can use sample_weight when fit but I cannot use it when I deal with VotingClassifier or GridSearch. Could someone help?
Currently there isn't a way to use class_weights for GB in sklearn.
Don't confuse this with sample_weight
Sample Weights change the loss function and your score that you're trying to optimize. This is often used in case of survey data where sampling approaches have gaps.
Class Weights are used to correct class imbalances as a proxy for over \ undersampling. There is no direct way to do that for GB in sklearn (you can do that in Random Forests though)
Very late, but I hope it can be useful for other members.
In the article of Zichen Wang in towardsdatascience.com, the point 5 Gradient Boosting it is told:
For instance, Gradient Boosting Machines (GBM) deals with class imbalance by constructing successive training sets based on incorrectly classified examples. It usually outperforms Random Forest on imbalanced dataset For instance, Gradient Boosting Machines (GBM) deals with class imbalance by constructing successive training sets based on incorrectly classified examples. It usually outperforms Random Forest on imbalanced dataset.
And a chart shows that the half of the grandient boosting model have an AUROC over 80%. So considering GB models performances and the way they are done, it seems not to be necessary to introduce a kind of class_weight parameter as it is the case for RandomForestClassifier in sklearn package.
In the book Introduction To Machine Learning with Pyhton written by Andreas C. Müller and Sarah Guido, edition 2017, page 89, Chapter 2 *Supervised Learning, section Ensembles of Decision Trees, sub-section Gradient boosted regression trees (gradient boosting machines):
They are generally a bit more sensitive to
parameter settings than random forests, but can provide better accuracy if the parameters are set correctly.
Now if you still have scoring problems due to imbalance proportions of categories in the target variable, it is possible you should see if your data should be splited to apply different models on it, because they are not as homogeneous as it seems to be. I mean it may have a variable you have not in your dataset train (an hidden variable clearly) that influences a lot the model results, then it is difficult even for the greater GB to give correct scoring because it misses a huge information that you cannot make appear in the matrix to compute sometimes for many reasons.
Some updates:
I found, by random, there are libraries that implement it as parameters of their gradient boosting instance objects. It is the case of H2O where for the parameter balance_classes it is told:
Balance training data class counts via over/under-sampling (for
imbalanced data).
Type: bool (default: False).
If you want to keep with sklearn you should do as HakunaMaData told: over/under-sampling because that's what other libraries finally do when the parameter exist.

sklearn: AUC score for LinearSVC and OneSVM

One option of the SVM classifier (SVC) is probability which is false by default. The documentation does not say what it does. Looking at libsvm source code, it seems to do some sort of cross-validation.
This option does not exist for LinearSVC nor OneSVM.
I need to calculate AUC scores for several SVM models, including these last two. Should I calculate the AUC score using decision_function(X) as the thresholds?
Answering my own question.
Firstly, it is a common "myth" that you need probabilities to draw the ROC curve. No, you need some kind of threshold in your model that you can change. The ROC curve is then drawn by changing this threshold. The point of the ROC curve being, of course, to see how well your model is reproducing the hypothesis by seeing how well it is ordering the observations.
In the case of SVM, there are two ways I see people drawing ROC curves for them:
using distance to the decision bondary, as I mentioned in my own question
using the bias term as your threshold in the SVM: http://researchgate.net/post/How_can_I_plot_determine_ROC_AUC_for_SVM. In fact, if you use SVC(probabilities=True) then probabilities will be calculated for you in this manner, by using CV, which you can then use to draw the ROC curve. But as mentioned in the link I provide, it is much faster if you draw the ROC curve directly by varying the bias.
I think #2 is the same as #1 if we are using a linear kernel, as in my own case, because varying the bias is varying the distance in this particular case.
In order to calculate AUC, using sklearn, you need a predict_proba method on your classifier; this is what the probability parameter on SVC does (you are correct that it's calculated using cross-validation). From the docs:
probability : boolean, optional (default=False)
Whether to enable probability estimates. This must be enabled prior to calling fit, and will slow down that method.
You can't use the decision function directly to compute AUC, since it's not a probability. I suppose you could scale the decision function to take values in the range [0,1], and compute AUC, however I'm not sure what statistical properties this will have; you certainly won't be able to use it to compare with ROC calculated using probabilities.

How to consistently standardize sparse feature matrix in scikit-learn?

I am using sklearn's DictVectorizer to construct a large, sparse feature matrix, which is fed to an ElasticNet model. Elastic net (and similar linear models) work best when predictors (columns in the feature matrix) are centered and scaled. The recommended approach is to build a Pipeline that uses a StandardScaler prior to the regressor, however that doesn't work with sparse features, as stated in the docs.
I thought to use the normalize=True flag in ElasticNet which seems to support sparse data, however it's not clear whether the normalization is applied during prediction to the test data as well. Does anyone know if normalize=True applies for prediction as well? If not, is there a way to use the same standardization on the training and test set when dealing with sparse features?
Digging through the sklearn code, it looks like when fit_intercept=True and normalize=True, the coefficients estimated on the normalized data are projected back to the original scale of the data. This is similar to the way glmnet in R handles standardization. The relevant code snippet is the method _set_intercept of LinearModel, see https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/base.py#L158. So predictions on unseen data use coefficients in the original scale, i.e., normalize=True is safe to use.

Resources