how to calculate BIC, score... WITHOUT fit? - python-3.x

I know that thanks to scikit tool, we can calculate BIC or score for Gaussian mixture model as shown below easily.
clf.fit(data)
bic=clf.bic(data)
score=clf.score(data)
http://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html
but my question is, how to calculate bic or score WITHOUT using fit method, when I already have weights, means, covariances and data?
I could set as
clf = mixture.GaussianMixture(n_components=3, covariance_type='full')
clf.weights_=weights_list
clf.means_=means_list
clf.covariances_=covariances_list
or
clf.weights_init=weights_list
clf.means_init=means_list
clf.precisions_init =np.linalg.inv(covariances_list)
but when I try to get bic,
bic=clf.bic(data)
I get error message saying
sklearn.exceptions.NotFittedError: This GaussianMixture instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.
I don'T want to run fit, because it will change given weights, means and covariances..
What can i do?
thanks

You need to set these three variables to pass the check_is_fitted test: 'weights_', 'means_', 'precisions_cholesky_'. 'weights_' and 'means_', you are setting correctly. And for calculating 'precisions_cholesky_' you need to have covariances_ which you do have.
So, just calculate that using this method here
from sklearn.mixture.gaussian_mixture import _compute_precision_cholesky
precisions_cholesky = _compute_precision_cholesky(covariances_list, 'full')
Change the "full" to appropriate covariance type and then set the result to clf using
clf.precisions_cholesky_ = precisions_cholesky
Make sure the shape of all these variables correspond correctly to your data.

Related

Using cross-validation to calculate feature importance "Some Questions"

I am currently working on a project. I already selected my features and want to check their importance. I have some questions if anyone can help me please.
1- Does it make sense if I use RandomForestClassifier with cross-validation to calculate the feature importance?
2- I tried it to calculate the feature Importance using the cross_validate function
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html . The function provides the test_score and train_score results. The results I got with a 10 Fold cross-validation were as follows:
test_score [0.99950158, 0.9997231 , 0.9997231 , 0.99994462, 0.99977848, 0.99983386, 0.99977848, 0.9997231 , 0.99977847, 1.]
train_score [0.99998769, 0.99998154, 0.99997539, 0.99997539, 0.99998154,0.99997539, 0.99998154, 0.99997539, 0.99998154, 0.99997539],
Can anyone explain these results? And what does it indicate?
3- The cross_validate function has a parameter called scoring, which has different scoring values such as accuracy, balanced_accuracy and f1. What does the scoring parameter do? And what do these values mean? And how should I decide which one to choose? I already read the scikit-learn documentation but wasn't clear to me.
Thank you.
Your question 1 is slightly out of scope here. For each run (fold) of cross-validation, you will get an array of importance for your features. Then how would you combine those into single importance per feature? There may be outputs which can show that a specific feature is important based on higher scores on different folds. But that may vary.
Now, cross_validate will return the default score of the estimator used inside it, unless the scoring param is set. So if you leave the scoring, it will use RandomForestClassifier's score() method which returns accuracy.
(In scikit, all classifiers will return accuracy in score() and all regressors will return r-squared value)
So for your question 2: the returned scores are accuracies per cv fold.
If you do not want to use accuracy and want some other score, you may set the scoring param incross_validate.

Intercept in linear regression

I am new to machine learning and I am confused with what is the function of linear regression intercept parameter is doing.
When setting the parameter, fit_intercept=False, I get the .coef_ value as 287.986236, however, when setting fit_intercept=True, I get the .coef_ value as 225.81285046.
Why is there a difference? And I am not sure how to interpret the results and compare these values!
lm = LinearRegression(fit_intercept=False).fit(REStaten_[['GROSS_SQUARE_FEET']], REStaten_['SALE_PRICE'])
lm.coef_
# 287.986236
lm = LinearRegression(fit_intercept=True).fit(REStaten_[['GROSS_SQUARE_FEET']], REStaten_['SALE_PRICE'])
lm.coef_
# 225.81285046
The Slope and Intercept are the very important concept of Linear regression.
The slope indicates the steepness of a line and the intercept indicates the location where it intersects an axis.
If we set the Intercept as False then, no intercept will be used in calculations (e.g. data is expected to be already centered).
When we are using LR model in a dataset, It is trying to plot the "Line of best fit" by increasing or decreasing the Slope and Intercept values.
You are getting different .coef_ values because you are disabling the Intercept parameter in your first attempt and enabling it on your second attempt.
Hope this helps. For more info, you can refer the scikit-learn documentation.
Sk Learn Linear regression

How to use cross-validation splitted data with RandomizedSearchCV

I'm trying to transfer my model from single run to hyper-parameter tuning using RandomizedSearchCV.
In my single run case, my data is splitted into train/validation/test data.
When I run RandomizedSearchCV on my train_data with default 3-fold CV, I notice that the length of my train_input is reduced to 66% of train_data (which makes sense in a 3-fold CV...).
So I'm guessing that I should merge my initial train and validation set into a larger train set and let RandomizedSearchCV split it into train and validation sets.
Would that be the right way to go?
My question is: how can I access the remaining 33% of my train_input to feed it to my validation accuracy test function (note that my score function is running on test set)?
Thanks for your help!
Yoann
I'm not sure that my code would help here since my question is rather generic.
This is the answer that I found by going through sklearn's code: the RandomizedSearchCV doesn't return the splited validation data in an easy way and I should definitely merge my initial train and validation set into a larger train set and let RandomizedSearchCV split it into train and validation sets.
The train_data is splitted for CV using a cross-validator into a train/validation set (in my case, the Stratified K-Folds http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html)
My estimator is defined as follows:
class DNNClassifier(BaseEstimator, ClassifierMixin):
It needs a score function to be able to evaluate the CV performance on the validation set. There is a default score function defined in the ClassifierMixin class (which returns the the mean accuracy and requires a predict function to be implemented in the Estimator class).
In my case, I implemented a custom score function within my estimator class.
The hyperparameter search and CV fit is done calling the fit function of RandomizedSearchCV.
RandomizedSearchCV(DNNClassifier(), param_distribs).fit(train_data)
This fit function runs the estimator's custom fit function on the train set and then the score function on the validation set.
This is done using the _fit_and_score function from the ._validation library.
So I can access the automatically splitted validation set (33% of my train_data input) at the end of my estimator's fit function.
I'd have preferred to access it within my estimator's fit function so that I can use it to plot validation accuracy over training steps and for early stop (I'll keep a separate validation set for that).
I guess I could reconstruct the automatically generated validation set by looking for the missing indexes from my initial train_data (the train_data used in the estimator's fit function has 66% of the indexes of the initial train_data).
If that is something that someone has already done I'd love to hear about it!

How to provide weighted eval set to XGBClassifier.fit()?

From the sklearn-style API of XGBClassifier, we can provide eval examples for early-stopping.
eval_set (list, optional) – A list of (X, y) pairs to use as a
validation set for early-stopping
However, the format only mentions a pair of features and labels. So if the doc is accurate, there is no place to provide weights for these eval examples.
Am I missing anything?
If it's not achievable in the sklearn-style, is it supported in the original (i.e. non-sklearn) XGBClassifier API? A short example will be nice, since I never used that version of the API.
As of a few weeks ago, there is a new parameter for the fit method, sample_weight_eval_set, that allows you to do exactly this. It takes a list of weight variables, i.e. one per evaluation set. I don't think this feature has made it into a stable release yet, but it is available right now if you compile xgboost from source.
https://github.com/dmlc/xgboost/blob/b018ef104f0c24efaedfbc896986ad3ed1b66774/python-package/xgboost/sklearn.py#L235
EDIT - UPDATED per conversation in comments
Given that you have a target-variable representing real-valued gain/loss values which you would like to classify as "gain" or "loss", and you would like to make sure the validation-set of the classifier weighs the large-absolute-value gains/losses heaviest, here are two possible approaches:
Create a custom classifier which is just XGBoostRegressor fed to a treshold where the real-valued regression predictions are converted to 1/0 or "gain"/"loss" classifications. The .fit() method of this classifier would just call .fit() of xgbregressor, while .predict() method of this classifier would call .predict() of the regressor and then return the thresholded category predictions.
you mentioned you would like to try weighting the treatment of the records in your validation set, but there is no option for this in xgboost. The way to implement this would be to implement a custom eval-metric. However, you pointed out that eval_metric must be able to return a score for a single label/pred record at a time, so it couldn't accept all your row-values and perform the weighting in the eval metric. The solution to this you mentioned in your comment was "create a callable which has a ref to all validation examples, pass the indices (instead of labels and scores) into eval_set, use the indices to fetch labels and scores from within the callable and return metric for each validation examples." This should also work.
I would tend to prefer option 1 as more straightforward, but trying two different approaches and comparing results is generally a good idea if you have the time, so interested how these turn out for you.

How am I supposed to use RandomizedLogisticRegression in Scikit-learn?

I simply have failed to understand the documentation for this class.
I can fit data using it, and get the scores for features, but it this all this class is supposed to do?
I can't see how I can use it to actually perform regression using the model that was fit. The example in the documentation above is simply creating an instance of the class, so I can't see how that is supposed to help.
There are methods that perform 'transform' operation, but no mention of what kind of transform that is.
so is it possible to use this class to get actual predictions on new test data, and is it possible to use it in cross fold validation to compare performance with other methods I'm using?
I've used the highest ranking features in other classifiers, but I'm not sure if more than that is possible with this classifier.
Update: I've found the use for fit_transform under feature selection part of the documentation:
When the goal is to reduce the dimensionality of the data to use with another classifier, they expose a transform method to select the non-zero coefficient
Unless I get an answer that says I'm wrong, I'll assume that this classifier indeed does not do prediction. I'll wait before I answer my own question.
Randomized LR is supposed to be a feature selection method, not a classifier in and of itself. Its API matches that of a standard scikit-learn transformer:
randomlr = RandomizedLogisticRegression()
X_train = randomlr.fit_transform(X_train)
X_test = randomlr.transform(X_test)
Then fit a model to X_train and do classification on X_test as usual.

Resources