How to handle categorical features with many unique values in python/Scikit learn - python-3.x

In my situation, I would like to encode around 5 different columns in my dataset but the issue is that these 5 columns have many unique values.
If I encode them using label encoder I add an unnecessary order that is not right whereas if I do OHE or pd.get_dummies then I end up having a lot of features that will add to much sparseness in the data.
I am currently dealing with a supervised learning problem and the following are the unique values per column:
Job_Role : Unique categorical values = 29
Country : Unique categorical values = 12
State : Unique categorical values = 14
Segment : Unique categorical values = 12
Unit : Unique categorical values = 10
I have already looked into multiple references but not sure about the best approach. What should in this situation to have least number of features with maximum positive impact on my model

As far as I know, usually uses OneHotEncoder for these cases but as you said, there are so many unique values in your data. I've looked for a solution for a project before and I saw different ways as follows,
OneHotEncoder + PCA: I think this way is not quite right, because PCA is designed for continuous variables.[*]
Entity Embeddings: I don't know this way very well, but you can check it from the link in the title.
BinaryEncoder: I think, this is useful when you have a large number of categories and doing the one-hot encoding will increase the dimensions and which in turns increases model complexity. So, binary encoding is a good choice to encode the categorical variables with less number of dimensions.
There are some other solutions in category_encoders library.

Related

identifying categorical variables in a dataset

I have a dataset with 150+ features, and I want to separate them as text, categories and numerics. The categorical and text variables are having the Object data type. How do we distinguish between a categorical and text variable? Is there any threshold value for the categorical variable?
There is no clear distinction between categories and text. However, if you want to understand if a particular feature is categorical you can do a simple test.
e.g. if you are using pandas, you can use value_counts() / unique() for a feature. If the number of results are comparable to the size of the dataset, this is not a categorical field.
Similarly for numerics too.. But in numerics it may be Ordinal, meaning there is a clear ordering. e.g., size of t-shirts.

How do I use categorical columns in Deep Learning?

I m using this dataset of crop agriculture. In order to use it for creating a neural network, I preprocessed the data using MinMaxScalar, this would scale the data between 0 and 1. But my dataset also consist of categorical columns, because of which I got an error during preprocessing. So I tried encoding the categorical columns using OneHotEncoder and LabelEncoder but I don't understand what to do with it then.
My aim is to predict "Crop_Damage".
How do I proceed ?
Link to the dataset -
https://www.kaggle.com/aniketng21600/crop-damage-information-in-india
You have several options.
You may use one hot encoding and pass your categorical variable to network as one-hot network.
You may get inspiration from NLP and their processing. One hot vectors are sparse and may be really huge(depends on unique values of your categorical variable). Please look at techniques Word2vec(cat2vec) or GloVe. Both of them aims to create from categorical element, nonsparse numeric vector(meaningful).
Beside of these two keras offer another way how to obtain this numeric vector. Its called embeded layer. For example, lets consider that you have variable Crop damage with these values:
Huge
Medium
Little
First you assign unique integer for every unique value of your categorical variable.
Huge = 0
Medium = 1
Little= 2
Than you pass translated categorical values(unique integers) to emebeded layer. Embeded layer takes at input sequence of unique integers and produce sequence of dense vectors. Values of these vectors are firstly random, but during training are optimized like regular weights of neural network. So we can say that during the training neural network build vector representation of categories according to loss function.
For me is embeded layer the easiest way to obtain good enough vector representation of categorical variables. But you can try first with one hot if accuracy satisfy you.
here is a one hot encoder. df is the data frame you are working with, column is the name
of the column you want to encode. prefix is a string that gets appended to the column names created by pandas dummies. What happens is the new dummy columns are created and
appended to the data frame as new columns. The original column is then deleted.
There is an excellent series of videos on encoding data frames and other topics on Youtube here.
def onehot_encode(df, column, prefix):
df = df.copy()
dummies = pd.get_dummies(df[column], prefix=prefix)
df = pd.concat([df, dummies], axis=1)
df = df.drop(column, axis=1)
return df

How do I handle categorical data where there are different numbers of categories for a data point in training and testing?

I am working on the following Kaggle project: https://www.kaggle.com/c/house-prices-advanced-regression-techniques.
My question is, what if there is an option for a categorical value that is in the test data, but is not in the training data, or vice versa. For example, if data point A has options [a,b], in the training data, but options [a,b,c] in the testing data, or vice versa. Thanks for your help!
I just want to be able to train and run my neural network properly.
Are you OneHotEncoding (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) your categorical data? You could avoid this problem by OneHotEncoding the categorical columns before splitting. This will mean that the model will be trained to consider this column. A drawback of this approach would that the training would only ever see '0' values for the column that represents the value which isn't seen in the training set. Maybe not the best option but could solve the issue you are seeing?
Is the issue caused by the fact that you have a very small dataset, or that you have a column that has lots of unique values?

Mixed data type tensor-flow based random forest regression

As the topic suggests I would like to create tensor-flow based random forest regression, using python for our data set which contains the following columns:
HotelName(text& categorical),Country(text categorical). Review(text..?), date( continous or categorical not sure...) and some continous valued columns.
My questions are:
What should be the exact categories of the data types we mentioned above, and is any mapping/discretization of the features necessary( for example, if there are 10 countries, we map them to integers 1-10)
How do we implement the random forest tensorflow model? I searched on the internet but only found the iris data set random forest example ( which has only continous data). In the estimator api, one can specify the type of value of each column, but that doesnt work with tensor_forest right? How should I do the implementation?
Thanks and wishing everyone a happy new year!

Does it make sense to do vecterindex after stringindex on categorical features?

Say I have bunch of categorical string columns in my dataframe. Then I do below transform:
StringIndex the columns
then I use VectorAssembler to assemble all the transformed columns into one vector feature column
do VectorIndexer on the new vector feature column.
Question: for step 3, does it make sense, or is it duplicated effort? I think step 1 already did the index.
Yes it makes sense if you're going to use Spark tree based algorithm (RandomForestClassifier or GBMClassifier) and you have high cardinality features.
E.g. for criteo dataset StringIndexer would convert values in categorical column to integers in range 1 to 65000. It will save this in metadata as a NominalAttribute. Then in RFClassifier it would extract this from metadata as categorical features.
For tree based algorithms you have to specify maxBins parameter that
Must be >= 2 and >= number of categories in any categorical feature.
Too high maxBins parameter would lead to slow performance. To solve this need to use VectorIndexer with .setMaxCategories(64) for example. This will treat as categorical variables only those that has <64 unique values.

Resources