How to provide weighted eval set to XGBClassifier.fit()? - scikit-learn

From the sklearn-style API of XGBClassifier, we can provide eval examples for early-stopping.
eval_set (list, optional) – A list of (X, y) pairs to use as a
validation set for early-stopping
However, the format only mentions a pair of features and labels. So if the doc is accurate, there is no place to provide weights for these eval examples.
Am I missing anything?
If it's not achievable in the sklearn-style, is it supported in the original (i.e. non-sklearn) XGBClassifier API? A short example will be nice, since I never used that version of the API.

As of a few weeks ago, there is a new parameter for the fit method, sample_weight_eval_set, that allows you to do exactly this. It takes a list of weight variables, i.e. one per evaluation set. I don't think this feature has made it into a stable release yet, but it is available right now if you compile xgboost from source.
https://github.com/dmlc/xgboost/blob/b018ef104f0c24efaedfbc896986ad3ed1b66774/python-package/xgboost/sklearn.py#L235

EDIT - UPDATED per conversation in comments
Given that you have a target-variable representing real-valued gain/loss values which you would like to classify as "gain" or "loss", and you would like to make sure the validation-set of the classifier weighs the large-absolute-value gains/losses heaviest, here are two possible approaches:
Create a custom classifier which is just XGBoostRegressor fed to a treshold where the real-valued regression predictions are converted to 1/0 or "gain"/"loss" classifications. The .fit() method of this classifier would just call .fit() of xgbregressor, while .predict() method of this classifier would call .predict() of the regressor and then return the thresholded category predictions.
you mentioned you would like to try weighting the treatment of the records in your validation set, but there is no option for this in xgboost. The way to implement this would be to implement a custom eval-metric. However, you pointed out that eval_metric must be able to return a score for a single label/pred record at a time, so it couldn't accept all your row-values and perform the weighting in the eval metric. The solution to this you mentioned in your comment was "create a callable which has a ref to all validation examples, pass the indices (instead of labels and scores) into eval_set, use the indices to fetch labels and scores from within the callable and return metric for each validation examples." This should also work.
I would tend to prefer option 1 as more straightforward, but trying two different approaches and comparing results is generally a good idea if you have the time, so interested how these turn out for you.

Related

Keras: Get predict result and intermediate layer value at the same time

I want to get the results of the last layer which is the category label and the intermediate layer value at the same time, for example, after I call model.predict().
Is it possible?
The accepted answer to this post should point you in the right direction:
multiple-outputs-in-keras
You will likely need to use Keras' functional API for this as you can specify multiple inputs and outputs.
Keras Functional API Guide

sklearn Group K-fold and groups parameters in other cross-validators similarity

Maybe it is obvious but I would like to be sure of what I am doing:
I understand that Group K-fold implemented in sklearn, is a variation of k-fold cross validation where it is ensured that data belonging to the same group will not be represented in train and sets at the same time.
That is what I also need. However, before I discover the aforementioned implementation of group k-fold, as i was trying to calculate the validation curve concerning a problem, I noticed the following parameter (the highlighted one):
validation_curve(estimator, X, y, param_name, param_range, groups=None, cv=None...)
According to the documentation if I provide a list of size [n_samples] providing the labels for the corresponding groups, then train/test dataset splitting will be done according to these labels.
And here comes the question. Since a such convenient variable is provided, why - according to my searches- everyone in need of group k-fold validation is first using sklearn.model_selection.GroupKFold ?
Am I missing something here?

kmeans clustering transform method specifying centroids

In the scikit-learn kmeans source code, there is an optional argument y that can be specified (transform(X[, y])); however when I examined the source code for transform, it seems that nowhere does it deal with y in the case that it is specified. What is the purpose of this optional argument (it is not clear in the documentation either)?
As an addendum; I was wondering if there was any way to specify the centroids in the transform function if they're already computed previously. (Or if there was any other function to do this in scikit-learn).
Centroid specification
You could just overwrite kmeans_object.cluster_centers_ with your own centroids. But it might be better just using init with these centers and do some iterations.
See the available attributes in the docs.
To answer your first question about the seemingly pointless argument y. You are correct, in many cases, Scikit-Learn allows users to pass a y argument that actually doesn't affect the outcome of the method.
As explained in their documentation:
y might be ignored in the case of unsupervised learning. However, to
make it possible to use the estimator as part of a pipeline that can
mix both supervised and unsupervised transformers, even unsupervised
estimators need to accept a y=None keyword argument in the second
position that is just ignored by the estimator. For the same reason,
fit_predict, fit_transform, score and partial_fit methods need to
accept a y argument in the second place if they are implemented.
So it's all to make the code easier to write. Imagine that you have a pipeline that looks like this:
step 0: some normalization
step 1: K-means to transform the data in another space
step 2: classification step
Step 1 obviously doesn't need y to work, but if you have to write the code to make the pipeline apply all of these steps, it'll be easier to simply pass X, y into all transformers, rather than having to worry about whether each individual transformer takes a y or not

scikit-learn clustering: predict(X) vs. fit_predict(X)

In scikit-learn, some clustering algorithms have both predict(X) and fit_predict(X) methods, like KMeans and MeanShift, while others only have the latter, like SpectralClustering. According to the doc:
fit_predict(X[, y]): Performs clustering on X and returns cluster labels.
predict(X): Predict the closest cluster each sample in X belongs to.
I don't really understand the difference between the two, they seem equivalent to me.
In order to use the 'predict' you must use the 'fit' method first. So using 'fit()' and then 'predict()' is definitely the same as using 'fit_predict()'. However, one could benefit from using only 'fit()' in such cases where you need to know the initialization parameters of your models rather than if you use 'fit_predict()', where you will just be obtained the labeling results of running your model on the data.
fit_predict is usually used for unsupervised machine learning transductive estimator.
Basically, fit_predict(x) is equivalent to fit(x).predict(x).
This might be very late to add an answer here, It just that someone might get benefitted in future
The reason I could relate for having predict in kmeans and only fit_predict in dbscan is
In kmeans you get centroids based on the number of clusters considered. So once you trained your datapoints using fit(), you can use that to predict() a new single datapoint to assign to a specific cluster.
In dbscan you don't have centroids , based on the min_samples and eps (min distance between two points to be considered as neighbors) you define, clusters are formed . This algorithm returns cluster labels for all the datapoints. This behavior explains why there is no predict() method to predict a single datapoint. Difference between fit() and fit_predict() was already explained by other user -
In another spatial clustering algorithm hdbscan gives us an option to predict using approximate_predict(). Its worth to explore that.
Again its my understanding based on the source code I explored. Any experts can highlight any difference.

How am I supposed to use RandomizedLogisticRegression in Scikit-learn?

I simply have failed to understand the documentation for this class.
I can fit data using it, and get the scores for features, but it this all this class is supposed to do?
I can't see how I can use it to actually perform regression using the model that was fit. The example in the documentation above is simply creating an instance of the class, so I can't see how that is supposed to help.
There are methods that perform 'transform' operation, but no mention of what kind of transform that is.
so is it possible to use this class to get actual predictions on new test data, and is it possible to use it in cross fold validation to compare performance with other methods I'm using?
I've used the highest ranking features in other classifiers, but I'm not sure if more than that is possible with this classifier.
Update: I've found the use for fit_transform under feature selection part of the documentation:
When the goal is to reduce the dimensionality of the data to use with another classifier, they expose a transform method to select the non-zero coefficient
Unless I get an answer that says I'm wrong, I'll assume that this classifier indeed does not do prediction. I'll wait before I answer my own question.
Randomized LR is supposed to be a feature selection method, not a classifier in and of itself. Its API matches that of a standard scikit-learn transformer:
randomlr = RandomizedLogisticRegression()
X_train = randomlr.fit_transform(X_train)
X_test = randomlr.transform(X_test)
Then fit a model to X_train and do classification on X_test as usual.

Resources