How to preprocess test data after one hot encoding - scikit-learn

I am a bit confused here, I have one hot encoded my categorical columns for all those < 10 unique values low_cardinality_cols , and dropped the remaining categorical columns for both Training and validation data.
Now I aim to apply my model to new data in a test.csv that. What would be the best method for pre-processing the test data to match train/validation format?
My concerns are:
1. Test_data.csv will certainly have different cardinality for those columns
2. If I one hot encode test data using low cardinality columns from training I get Input contains NaN but my train, valid & test columns are all the same number.
Sample one hot encoding below, this is for kaggle competition/intermediate course here
# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))
# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index
# Remove categorical columns (will replace with one-hot encoding)
# This also saves us the hassle of dropping columns
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)
# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

I would advise 2 things:
OneHotEncoder is a parameter handle_unknown="error" per default. It should be turned to handle_unknow="ignore" in the case that you mention (categories in testing not known during training).
Use a scikit-learn pipeline including your predictor instead of calling fit_transform and transform and then give the data to the predictor

As far as I could think, there are two possible solutions to this, I will illustrate both here and you can pick whichever works for you.
Solution 1
If it is possible for you to get all the possible levels/values of the categorical variable that you are planning to encode, you can pass them as the categories parameter when you perform one-hot encoding the default value for categories is auto which determines the categories automatically from the training data and will not account for the new categories found in testing data. Enforcing categories as a list of all possible categories will help us solve this problem. As even if your testing data has new categories that were not present in training/validation data they will all be encoded correctly and you won't be getting NaNs.
Solution 2
In case if you are not able to collect all possible categories of a categorical column you can go ahead and fit the one-hot encoder the way you have done, and when you try to transform your test data in order to handle NaNs which you will be encountering when you find a new class, you can use some kind of imputation techniques like SimpleImputer or IterativeImputer to impute the missing values and process further.

Related

How to preprocess string columns of NSL-KDD?

I have a NSL-KDD dataset, where it has some string columns
I have checked some code from github, most of whom just use one-hot encoder to transform them as the following.
I want to know whether I can transform these string columns as numerical column with 3 column rather than 84 one-hot encoder column. After all, they create too wide sparse space, which may make training accuracy worse.

Applying OneHotEncoding on categorical data with missing values

I want to OneHotEncode a pd.DataFrame with missing values.When I try to OneHotEncode, it throws an error regarding missing values.
ValueError: Input contains NaN
When I try to use a SimpleImputer to fix missing values, it throws an error regarding categorical data
ValueError: Cannot use mean strategy with non-numeric data:
could not convert string to float: 'RH'
I can't apply OneHotEncoding because of missing values and SimpleImputer because of categorical data.
Is there a way around this besides dropping columns or rows?
You can use either of the below 2 methods to eliminate NaN categorical values -
Option 1: Replace the missing values with the most frequent category. For instance, if you have a column with 51% values belonging to one category then use the below code to fill missing values of that category
df['col_name'].fillna('most_frequent_category',inplace=True)
Option 2: If you don't wish to impute missing values to the most frequent category then you can create a new category called 'Other' (or similar neutral category relevant to your variable)
df['col_name'].fillna('Other',inplace=True)
Both these methods will impute your missing categorical values and then you will be able to OneHotEncode them.

How to Label encode multiple non-contiguous dataframe columns

I am having a pandas dataframe with multiple columns (some of them non-contiguous) which would need to be label encoded. From my understanding of the LabelEncoder class, for each column I would need to use a different LabelEncoder object. I am using the code below (list_of_string_cols in the code below is a list of all the columns which needs to be label encoded)
for col in list_of_string_cols:
labelenc = LabelEncoder()
train_X[col] = labelenc.fit_transform(train_X[col])
test_X[col] = labelenc.transform(test_X[col])
Is this the correct way?
Yes that's correct.
Since LabelEncoder was primarily made to deal with labels and not features, so it allowed only a single column at a time.
Up until the current version of scikit-learn (0.19.2), what you are using is the correct way of encoding multiple columns. See this question which also does what you are doing:
Label encoding across multiple columns in scikit-learn
From next version onwards (0.20), OrdinalEncoder can be used to encode all categorical feature columns at once.

guidelines to handle missing categorical feature values in Random Forest Regressor

What is a general guideline to handle missing categorical feature values when using Random Forest Regressor (or any ensemble learner for that matter)? I know that scikit learn has impute function (like mean...strategy or proximity) to impute missing values (numerical). But, how does one handle missing categorical value : Like Industry (oil, computer, auto, None), major(bachelors, masters, doctoral, None).
Any suggestion is appreciated.
Breiman and Cutler, the inventors of Random Forest, suggest two possible strategies (see http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1):
Random forests has two ways of replacing missing values. The first way
is fast. If the mth variable is not categorical, the method computes
the median of all values of this variable in class j, then it uses
this value to replace all missing values of the mth variable in class
j. If the mth variable is categorical, the replacement is the most
frequent non-missing value in class j. These replacement values are
called fills.
The second way of replacing missing values is computationally more
expensive but has given better performance than the first, even with
large amounts of missing data. It replaces missing values only in the
training set. It begins by doing a rough and inaccurate filling in of
the missing values. Then it does a forest run and computes
proximities.
Alternatively, leaving your label variable aside for a minute, you could train a classifier on rows that have non-null values for the categorical variable in question, using all of your features in the classifier. Then use this classifier to predict values for the categorical variable in question in your 'test set'. Armed with a more complete data set, you can now return to the task of predicting values for your original label variable.

Polynominal error in Rapidminer when doing n-gram classification

I am trying to classify different concepts in a text using n-gram. My data tyically exists of six columns:
The word that needs classification
The classification
First word on the left of 1)
Second word on the left of 1)
First word on the right of 1)
Second word on the right of 1)
When I try to use a SVM in Rapidminer, I get the error that it can not handle polynominal values. I know that this can be done because I have read it in different papers. I set the second column to 'label' and have tried to set the rest to 'text' or 'real', but it seems to have no effect. What am I doing wrong?
You have to use the Support Vector Machine (LibSVM) Operator.
In contrast to the classic SVM which only supports two class problems, the LibSVM implementation (http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf) supports multi-class classification as well as regression.
One approach could be to create attributes with names equal to the words and values equal to the distance from the word of interest. Of course, all possible words would need to be represented as attributes so the input data would be large.

Resources