I want to select the most important features out of my data containing categorical and numerical features. I tried SelectFromModel and RFE.
As a preprocessing step, I have transform my categorical features via OHE into multiple features. (e.g. weekdays -> monday, tuesday, wednesday...)
The above mentioned methods now select only parts of categorical features (e.g. only monday). Is there any other way to either select entire categorical features or drop them completely than brute force all combinations of categorical features?
There are many other ways to do this.
A simple solution is using Label Encoder, which maps your categories to integers. For example, 'Monday' turns into 0, 'Tuesday' turns into 1, 'Wednseday' into 2, etc.
The problem with this is that it will trick your random forest into thinking that Tuesday (1) is greater than Monday (0), which may or may not be a reasonable thing to assume. Label Encoder is therefore only recommended when your categories have an intrinsic ranking (such as bad < neutral < good < awesome).
Another approach would be to group by the categorical column and get the mean of the target column. You can then assign numeric values to each category depending on how they relate to the target. If you want more details on this second option, post a more detailed question.
Finally, I recommend using catboost. It is a gradient boosting algorithm that can handle categorical data natively.
If you want Random Forests or a RFE algorithms to consider the categorical variable as a whole, then one hot encoding is not the way forward.
You can encode the variable with integers using the OrdinalEncoder transformers available in any of the open source libraries Sklearn, Category_encoders or Feature-engine.
Random forests should be able to capture non-linear relationships, hence it is not super important if the Tuesday > Monday relationship is lost in the encoding.
You can of course, replace the categories by integers of your choosing with mappings, for example [df\[my_var\].map({'Monday':1, 'Tuesday':2 ... })][3]. If you do this, you could keep those intrinsic ordinal relationships.
There are also alternative encodings that create a monotonic relationship between your categorical variable and the target, like MeanEncoding or Ordinal encoding following the target mean. These are available in the libraries Feature-engine and Category encoders. But for tree based models, monotonic encodings are not always helpful. The advantage of these encodings is that they return 1 variable per categorical variable, instead of the dummy variables, which suits your problem.
Creating the variables you mention, IsMonday, IsTuesday and so on, is also a suitable option if your willing to ignore the other labels in the variable. You can always select with RFE the most predictive variables and then train the final model using the selected ones plus the ignored categories (i.e., IsWed, IsThu, etc). If the model uses regularization, this should not influence the final performance.
Just set the estimator of the selector object to be a pipeline (and/or ColumnTransformer) that includes the OneHotEncoder.
Related
Assume that in a machine learning problem, there are several categorical features in dataset.
One common way to handle categorical features is one-hot encoding. However, in this example, authors applied OrdinalEncoder on categorical features before model fitting and getting feature importances.
I would like to ask if sklearn algorithms, in general, treat OrdinalEncoded features as continuous or categorical features.
If sklearn models treat OrdinalEncoded features as continuous features, is it the correct way to handle categorical features?
At the end, OrdinalEncoded features are just numbers (float), so as CutePoison said, they are treated as continuous way.
OrdinalEncoded features is the correct way to work? It depends, you should ask yourself, the order of data is important?.
If its important, you can use OrdinalEncoder. Typical example is rating of a movie: ["disgusting", "bad", "normal", "good", "super"]. As you can see, bad is "smaller" than "normal", so there an order importance.
However, in other categorical data like professions: ["police", "teacher", "lawyer", "engineer"] there is no order importance. You can't say that police is "smaller" than lawyer for example. Then, you have to use OneHotEncoder.
So, as conclusion, it depends on how is your categorical data.
It wouldn't make sense to convert a categorical feature to another categorical feature, when the issue is that you have a categorical feature.
If you read the inputs to the function you have:
dtype: default np.float64
Desired dtype of output.
i.e the output is a float as standard,
Is it the correct way? It, as all machine learning, depends on your application.
I am trying to use a recursive feature elimination with random forest to find out the optimal features. However, one thing I am confused about is what should I do with categorical variables? Most of time people are doing a one-hot encoder for the categorical variables. But if I do one hot encoder, how can I know which feature is important which is not? Because after doing a one-hot encoder, 1 feature may become multiple features.
My current way is doing a label encoder for all the categorical variables, which means I labeled all the categorical variables as integers. And then using the code
rfc = RandomForestClassifier(random_state=101)
rfecv = RFECV(estimator=rfc, step=1, cv=StratifiedKFold(10), scoring='accuracy')
rfecv.fit(X, target)
One feature is the 44 different county names, I am not sure if this is the right way to do it.
In the Mllib version of Random Forest there was a possibility to specify the columns with nominal features (numerical but still categorical variables) with parameter categoricalFeaturesInfo
What's about the ML Random Forest? In the user guide there is an example that uses VectorIndexer that converts the categorical features in vector as well, but it's written "Automatically identify categorical features, and index them"
In the other discussion of the same problem I found that numerical indexes are treated as continuous features anyway in random forest, and it's recommended to do one-hot encoding to avoid this, that seems to not make sense in the case of this algorithm, and especially given the official example mentioned above!
I noticed also that when having a lot of categories(>1000) in the categorical column, once they are indexed with StringIndexer, random forest algorithm asks me setting the MaxBin parameter, supposed to be used with continuous features. Does it mean that the features more than number of bins will be treated as continuous, as it's specified in the official example, and so StringIndexer is OK for my categorical column, or does it mean that the whole column with numerical still nominal features will be bucketized with assumption that the variables are continuous?
In the other discussion of the same problem I found that numerical indexes are treated as continuous features anyway in random forest,
This is actually incorrect. Tree models (including RandomForest) depend on column metadata to distinguish between categorical and numerical variables. Metadata can be provided by ML transformers (like StringIndexer or VectorIndexer) or added manually. The old mllib RDD-based API, which is used internally by ml models, uses categoricalFeaturesInfo Map for the same purpose.
Current API just takes the metadata and converts to the format expected by categoricalFeaturesInfo.
OneHotEncoding is required only for linear models, and recommended, although not required, for multinomial naive Bayes classifier.
I am new to machine learning and I am working on a classification problem with Categorical (nominal) data. I have tried applying BayesNet and a couple of Trees and Rules classification algorithms to the raw data. I am able to achieve an AUC of 0.85.
I further want to improve the AUC by pre-processing or transforming the data. However since the data is categorical I don't think that log transform, addition, multiplication etc. of different columns will work here.
Can somebody list down what are most common transformations applied on categorical data-sets? ( I tried one-hot encoding but it takes a lot of memory!!)
Categorical is in my experience best dealt with one-hot encoding (e.g converting to a binary vector) as you've mentioned. If memory is an issue, it may be worthwhile using an online classification algorithm and generate the modified vectors on the fly.
Apart from this, if the categories represent a range (for example, if the categories represent a range of values such as age, height or income) it may be possible to treat the centre (or some appropriate mean, if there's an intra-label distribution) of the category ranges as a real number.
If you were applying clustering you could also treat the categorical labels as points on an axis (1,2,3,4,5 etc), scaled appropriately to the other features.
I'm using the SVM classifier in the machine learning scikit-learn package for python.
My features are integers. When I call the fit function, I get the user warning "Scaler assumes floating point values as input, got int32", the SVM returns its prediction, I calculate the confusion matrix (I have 2 classes) and the prediction accuracy.
I've tried to avoid the user warning, so I saved the features as floats. Indeed, the warning disappeared, but I got a completely different confusion matrix and prediction accuracy (surprisingly much less accurate)
Does someone know why it happens? What is preferable, should I send the features as float or integers?
Thanks!
You should convert them as floats but the way to do it depends on what the integer features actually represent.
What is the meaning of your integers? Are they category membership indicators (for instance: 1 == sport, 2 == business, 3 == media, 4 == people...) or numerical measures with an order relationship (3 is larger than 2 that is in turn is larger than 1). You cannot say that "people" is larger than "media" for instance. It is meaningless and would confuse the machine learning algorithm to give it this assumption.
Categorical features should hence be transformed to explode each feature as several boolean features (with value 0.0 or 1.0) for each possible category. Have a look at the DictVectorizer class in scikit-learn to better understand what I mean by categorical features.
If there are numerical values just convert them as floats and maybe use the Scaler to have them loosely in the range [-1, 1]. If they span several order of magnitudes (e.g. counts of word occurrences) then taking the logarithm of the counts might yield better results. More documentation on feature preprocessing and examples in this section of the documentation: http://scikit-learn.org/stable/modules/preprocessing.html
Edit: also read this guide that has many more details for features representation and preprocessing: http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf