Applying OneHotEncoding on categorical data with missing values - python-3.x

I want to OneHotEncode a pd.DataFrame with missing values.When I try to OneHotEncode, it throws an error regarding missing values.
ValueError: Input contains NaN
When I try to use a SimpleImputer to fix missing values, it throws an error regarding categorical data
ValueError: Cannot use mean strategy with non-numeric data:
could not convert string to float: 'RH'
I can't apply OneHotEncoding because of missing values and SimpleImputer because of categorical data.
Is there a way around this besides dropping columns or rows?

You can use either of the below 2 methods to eliminate NaN categorical values -
Option 1: Replace the missing values with the most frequent category. For instance, if you have a column with 51% values belonging to one category then use the below code to fill missing values of that category
df['col_name'].fillna('most_frequent_category',inplace=True)
Option 2: If you don't wish to impute missing values to the most frequent category then you can create a new category called 'Other' (or similar neutral category relevant to your variable)
df['col_name'].fillna('Other',inplace=True)
Both these methods will impute your missing categorical values and then you will be able to OneHotEncode them.

Related

When I combine two pandas columns with zip into a dict it reduces my samples

I have two colums in pandas: df.lat and df.lon.
Both have a length of 3897 and 556 NaN values.
My goal is to combine both columns and make a dict out of them.
I use the code:
dict(zip(df.lat,df.lon))
This creates a dict, but with one element less than my original columns.
I used len()to confirm this. I can not figure out why the dict has one element
less than my columns, when both columns have the same length.
Another problem is that the dict has only raw values, but not the keys "lat" respectively "lon".
Maybe someone here has an idea?
You may have a different length if there are repeated values in df.lat as you can't have duplicate keys in the dictionary and so these values would be dropped.
A more flexible approach may be to use the df.to_dict() native method in pandas. In this example the orientation you want is probably 'records'. Full code:
df[['lat', 'lon']].to_dict('records')

How to preprocess test data after one hot encoding

I am a bit confused here, I have one hot encoded my categorical columns for all those < 10 unique values low_cardinality_cols , and dropped the remaining categorical columns for both Training and validation data.
Now I aim to apply my model to new data in a test.csv that. What would be the best method for pre-processing the test data to match train/validation format?
My concerns are:
1. Test_data.csv will certainly have different cardinality for those columns
2. If I one hot encode test data using low cardinality columns from training I get Input contains NaN but my train, valid & test columns are all the same number.
Sample one hot encoding below, this is for kaggle competition/intermediate course here
# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))
# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index
# Remove categorical columns (will replace with one-hot encoding)
# This also saves us the hassle of dropping columns
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)
# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
I would advise 2 things:
OneHotEncoder is a parameter handle_unknown="error" per default. It should be turned to handle_unknow="ignore" in the case that you mention (categories in testing not known during training).
Use a scikit-learn pipeline including your predictor instead of calling fit_transform and transform and then give the data to the predictor
As far as I could think, there are two possible solutions to this, I will illustrate both here and you can pick whichever works for you.
Solution 1
If it is possible for you to get all the possible levels/values of the categorical variable that you are planning to encode, you can pass them as the categories parameter when you perform one-hot encoding the default value for categories is auto which determines the categories automatically from the training data and will not account for the new categories found in testing data. Enforcing categories as a list of all possible categories will help us solve this problem. As even if your testing data has new categories that were not present in training/validation data they will all be encoded correctly and you won't be getting NaNs.
Solution 2
In case if you are not able to collect all possible categories of a categorical column you can go ahead and fit the one-hot encoder the way you have done, and when you try to transform your test data in order to handle NaNs which you will be encountering when you find a new class, you can use some kind of imputation techniques like SimpleImputer or IterativeImputer to impute the missing values and process further.

How to Label encode multiple non-contiguous dataframe columns

I am having a pandas dataframe with multiple columns (some of them non-contiguous) which would need to be label encoded. From my understanding of the LabelEncoder class, for each column I would need to use a different LabelEncoder object. I am using the code below (list_of_string_cols in the code below is a list of all the columns which needs to be label encoded)
for col in list_of_string_cols:
labelenc = LabelEncoder()
train_X[col] = labelenc.fit_transform(train_X[col])
test_X[col] = labelenc.transform(test_X[col])
Is this the correct way?
Yes that's correct.
Since LabelEncoder was primarily made to deal with labels and not features, so it allowed only a single column at a time.
Up until the current version of scikit-learn (0.19.2), what you are using is the correct way of encoding multiple columns. See this question which also does what you are doing:
Label encoding across multiple columns in scikit-learn
From next version onwards (0.20), OrdinalEncoder can be used to encode all categorical feature columns at once.

Imputation of categorical variables in python/scikit

I have a csv file with 23 columns of categorical string variables i.e. Gender, Location, skillset, etc.
Several of these columns have missing values. No column is missing more than 20% of its data so I would like to impute the missing categorical variables.
is this possible?
I have tried
from sklearn_pandas import CategoricalImputer
imputer=CategoricalImputer(strategy='most_frequent', axis=1)
imputer.fit(df[["Permission", "Hope"]])
imputer.transform(df)
but I am getting this error:
NameError: name 'categoricalImputer' is not defined
Will I have to Hotcode each of the 23 columns to intergers before I can impute?
or is it possible to impute missing categorical string variables?
CategoricalImputer is only introduced in version 0.20. So update with pip install git+git://github.com/scikit-learn/scikit-learn.git or check the github issue https://github.com/scikit-learn/scikit-learn/issues/10579

How to add NER tags to features

I have a set of training sentences for which I computed some float features. In each sentence, two entities are identified. They are either of type 'PERSON', 'ORGANIZATION', 'LOCATION', or 'OTHER'. I would like to add these types to my feature matrix (which stores float variables).
My question is: is there a recommended way to add these entity types ?
I could think of two ways for now:
either adding TWO columns, one for each entity, that will be filled with entity types ids (e.g 0 to 3 or 1 to 4)
adding EIGHT columns, one for each entity type and each entity, and filling them with 0's and 1's
Best!
I would recommend that you use something that can easily be normalized and which is in the same range as the rest of your data.
So if all your float values are between -1 and 1, i would keep the values from your "Named Entity Recognition" in the same range.
So depending on what you prefer or what gives you the best result you could either assign 4 values in the same range as the rest of your floats or use a binary result with more columns.
Finally, the second suggestion (adding EIGHT columns, one for each entity type and each entity, and filling them with 0's and 1's) worked fine!

Resources