Recursive Feature Elimination on Categorical Data in sklearn? - scikit-learn

I have a dataset containing 8 Parameters (4 Continuous 4 Categorical) and I am trying to eliminate features as per RFEC class in Scikit.
This is the formula I am using:
svc = SVC(kernel="linear")
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(y, 2),
scoring='accuracy')
rfecv.fit(X, y)
As I have categorical data also, I changed it to the Dummy Variable using dmatrics (Patsy).
I want to try different Classification models on the data after feature selection to improve model along with SVC.
I ran RFE after transforming data and I think I am doing wrong.
Do we run the RFECV before transforming the Categorical data or after?
I can't find any clear indication in any document.

It depends on whether you want to select given values of he categorical variable or the whole variable.
You are currently selecting single settings (aka levels) of the categorical variable.
To select the whole variables, you would probably need to do a bit of hackery, defining your own estimator based on SVC.
You could do make_pipeline(OneHotEncoder(categorical_features), SVC()) but then you need to set the coef_ of th pipeline to something that reflects the input shape.

Related

Is it possible to get the number of rows of the training set from a LGBMClassifier?

I have trained a model using lightgbm.sklearn.LGBMClassifier from the lightgbmpackage. I can find out the number of columns and column names of the training data from the model but I have not found a way to find the row number of the training data. Is it possible to do so? The best solution would be to obtain the training data from the model but I have not come across anything like that.
# This gives the number of the columns the model is trained with
lgbm_model.n_features_
# Any way to find out the row number of the training data as well?
lgbm_model.n_instances_ # does not exist!
The tree structure of a LightGBM model includes information about how many records from the training data would fall into each node in the tree if that node were a leaf node. In LightGBM's code, this value is called internal_count.
Since all data matches the root node of each tree, in most situations you can use that information to figure out, given a LightGBM model, how many instances were in the training data.
Consider the following example, using lightgbm==3.3.2 and Python 3.8.8.
import lightgbm as lgb
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=1234, centers=[[-4, -4], [-4, 4]])
clf = lgb.LGBMClassifier(num_iterations=10, subsample=0.5)
clf.fit(X, y)
num_data = clf.booster_.dump_model()["tree_info"][0]["tree_structure"]["internal_count"]
print(num_data)
# 1234
This will work in most cases. There are two special circumstances where this number could be misleading as an answer to the question "how much data was used to train this model":
if you set bagging_fraction<1.0, then at each iteration LightGBM will only use a fraction of the training data to evaluate splits (see the LightGBM docs for details on bagging_fraction)
if you use "training continuation", where you take an existing model and perform additional boosting rounds, and you use a different training set for those additional boosting rounds, then "how much data was used to train this model" will have a complicated answer that depends on which range of boosting rounds you're referring to by "this model"

Categorical variables in recursive feature elimination with random forest

I am trying to use a recursive feature elimination with random forest to find out the optimal features. However, one thing I am confused about is what should I do with categorical variables? Most of time people are doing a one-hot encoder for the categorical variables. But if I do one hot encoder, how can I know which feature is important which is not? Because after doing a one-hot encoder, 1 feature may become multiple features.
My current way is doing a label encoder for all the categorical variables, which means I labeled all the categorical variables as integers. And then using the code
rfc = RandomForestClassifier(random_state=101)
rfecv = RFECV(estimator=rfc, step=1, cv=StratifiedKFold(10), scoring='accuracy')
rfecv.fit(X, target)
One feature is the 44 different county names, I am not sure if this is the right way to do it.

What steps should I take next to improve my accuracy? Can data be the problem?

I built various ML models using sklearn for a binary classification problem. The data-set is provided to me by my professor for this comparative study.
my jupyter notebook and dataset can be found here
As I am getting very low accuracy, I fear that I must be doing something wrong while building the model. So I tested my decision tree on the inbuilt data-set in sklearn (breast cancer data-set) which is very similar to my data-set as both are binary classifications. Here I get an mean accuracy of 95 %. So I think right now that the problem might be my data-set. Can I get some help on how do I pre-process my data or any other steps that I might look into to improve accuracy.
Encode labels
Categorical data are variables that contain label values rather than numeric values.The number of possible values is often limited to a fixed set.
For example, users are typically described by country, gender, age group etc. We will use Label Encoder to label the categorical data. Label Encoder is the part of SciKit Learn library in Python and used to convert categorical data, or text data, into numbers, which our predictive models can better understand.
#Encoding categorical data values
from sklearn.preprocessing import LabelEncoder
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
Feature scaling
Most of the times, your dataset will contain features highly varying in magnitudes, units and range. But since, most of the machine learning algorithms use Eucledian distance between two data points in their computations. We need to bring all features to the same level of magnitudes. This can be achieved by scaling. This means that you’re transforming your data so that it fits within a specific scale, like 0–100 or 0–1. We will use StandardScaler method from SciKit-Learn library.
#Feature Scalingfrom sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Choosing Right model
You kight also want to vhoose the appropriate model. You can't just use neural nets or so for all problems it's the no free luch theorem. For this you could use K-fold cross validation, AIC and BIC

Testing on categorical variables Using sk-learn

I'm trying to generate and test model a on categorical variables using sci-kit learn. I'm interested in using the one-hot encoder function to encode these categorical variables in a sklearn pipeline after imputation of the data and before a random forest.
estimator = Pipeline([
("imputer", Imputer(missing_values='NaN', strategy="median",axis=0)),
("dummy", OneHotEncoder(categorical_features=np.where(mask), handle_unknown = 'ignore')),
("forest", RF())])
Training works fine, but the trouble comes when I try to test a generated model on new data. The categorical variables possible for this problem are not bounded, and not all possible categorical variables show up in the training dataset. So there might be test data that contain categorical variables that the model has never seen before, resulting in crashes in the prediction process due to mismatched dimensions.
As a concrete example, say one of the features I'm training on is fruit_name. The model trains on many examples of various fruits, including bananas, apples, and oranges. fruit_name is one-hot encoded in the pipeline. However, say I have test data that contains a fruit_name that the model has never seen before, like kiwi. Then the test data will have an extra column to the training data. Alternatively, say the test data doesn't actually contain bananas,apples or oranges. Then it will have fewer columns to the training data. Either way, model testing will crash.
How do I handle this issue with categorical variables using the sklearn pipeline?

advanced feature extraction for cross-validation using sklearn

Given a sample dataset with 1000 samples of data, suppose I would like to preprocess the data in order to obtain 10000 rows of data, so each original row of data leads to 10 new samples. In addition, when training my model I would like to be able to perform cross validation as well.
The scoring function I have uses the original data to compute the score so I would like cross validation scoring to work on the original data as well rather than the generated one. Since I am feeding the generated data to the trainer (I am using a RandomForestClassifier), I cannot rely on cross-validation to correctly split the data according to the original samples.
What I thought about doing:
Create a custom feature extractor to extract features to feed to the classifier.
add the feature extractor to a pipeline and feed it to, say, GridSearchCv for example
implement a custom scorer which operates on the original data to score the model given a set of selected parameters.
Is there a better method for what I am trying to accomplish?
I am asking this in connection to a competition going on right now on Kaggle
Maybe you can use Stratified cross validation (e.g. Stratified K-Fold or Stratified Shuffle Split) on the expanded samples and use the original sample idx as stratification info in combination with a custom score function that would ignore the non original samples in the model evaluation.

Resources