Imputation of categorical variables in python/scikit - scikit-learn

I have a csv file with 23 columns of categorical string variables i.e. Gender, Location, skillset, etc.
Several of these columns have missing values. No column is missing more than 20% of its data so I would like to impute the missing categorical variables.
is this possible?
I have tried
from sklearn_pandas import CategoricalImputer
imputer=CategoricalImputer(strategy='most_frequent', axis=1)
imputer.fit(df[["Permission", "Hope"]])
imputer.transform(df)
but I am getting this error:
NameError: name 'categoricalImputer' is not defined
Will I have to Hotcode each of the 23 columns to intergers before I can impute?
or is it possible to impute missing categorical string variables?

CategoricalImputer is only introduced in version 0.20. So update with pip install git+git://github.com/scikit-learn/scikit-learn.git or check the github issue https://github.com/scikit-learn/scikit-learn/issues/10579

Related

Python 3 - Pandas - Missing Data and Binning Values

I have the below code which pulls data from a CSV file and I want to perform some analysis on one of the variables in the data set. The variable is an object data type and it has a lot of missing values. Therefore, I am using pd.to_numeric to transform them to NaN. Then I create some bins (1,2,3,4,5) to categorise the values of the variable.However, I would like to include NaN as a category or bin in the new variable ("variable_q"). For example as "No Data" category. Is that possible? How can I do that?Am I missing anything>
import pandas as pd
import numpy as np
data=pd.read_csv("dataset.csv")
data["variable"]=pd.to_numeric(data["variable"],errors="coerce")
data["variable_q"]=pd.cut(x=data["variable"],bins=[1,2,3,4,5],labels=["Low","Moderate","High","Extremely High"])
Thank you very much!
Well, I am doing a frequency distribution on the variable and I just realised that adding dropna=False as an argument in the below code solves the problem.
A NaN category has been created!So there is no need to add a "No Data" category in the above code.
data["variable_q"].value_counts(sort=False,dropna=False)

Applying OneHotEncoding on categorical data with missing values

I want to OneHotEncode a pd.DataFrame with missing values.When I try to OneHotEncode, it throws an error regarding missing values.
ValueError: Input contains NaN
When I try to use a SimpleImputer to fix missing values, it throws an error regarding categorical data
ValueError: Cannot use mean strategy with non-numeric data:
could not convert string to float: 'RH'
I can't apply OneHotEncoding because of missing values and SimpleImputer because of categorical data.
Is there a way around this besides dropping columns or rows?
You can use either of the below 2 methods to eliminate NaN categorical values -
Option 1: Replace the missing values with the most frequent category. For instance, if you have a column with 51% values belonging to one category then use the below code to fill missing values of that category
df['col_name'].fillna('most_frequent_category',inplace=True)
Option 2: If you don't wish to impute missing values to the most frequent category then you can create a new category called 'Other' (or similar neutral category relevant to your variable)
df['col_name'].fillna('Other',inplace=True)
Both these methods will impute your missing categorical values and then you will be able to OneHotEncode them.

How to preprocess test data after one hot encoding

I am a bit confused here, I have one hot encoded my categorical columns for all those < 10 unique values low_cardinality_cols , and dropped the remaining categorical columns for both Training and validation data.
Now I aim to apply my model to new data in a test.csv that. What would be the best method for pre-processing the test data to match train/validation format?
My concerns are:
1. Test_data.csv will certainly have different cardinality for those columns
2. If I one hot encode test data using low cardinality columns from training I get Input contains NaN but my train, valid & test columns are all the same number.
Sample one hot encoding below, this is for kaggle competition/intermediate course here
# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))
# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index
# Remove categorical columns (will replace with one-hot encoding)
# This also saves us the hassle of dropping columns
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)
# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
I would advise 2 things:
OneHotEncoder is a parameter handle_unknown="error" per default. It should be turned to handle_unknow="ignore" in the case that you mention (categories in testing not known during training).
Use a scikit-learn pipeline including your predictor instead of calling fit_transform and transform and then give the data to the predictor
As far as I could think, there are two possible solutions to this, I will illustrate both here and you can pick whichever works for you.
Solution 1
If it is possible for you to get all the possible levels/values of the categorical variable that you are planning to encode, you can pass them as the categories parameter when you perform one-hot encoding the default value for categories is auto which determines the categories automatically from the training data and will not account for the new categories found in testing data. Enforcing categories as a list of all possible categories will help us solve this problem. As even if your testing data has new categories that were not present in training/validation data they will all be encoded correctly and you won't be getting NaNs.
Solution 2
In case if you are not able to collect all possible categories of a categorical column you can go ahead and fit the one-hot encoder the way you have done, and when you try to transform your test data in order to handle NaNs which you will be encountering when you find a new class, you can use some kind of imputation techniques like SimpleImputer or IterativeImputer to impute the missing values and process further.

How to perform group by on multiple columns in a pandas dataframe and predict future values using fbProphet in python?

My dataframe looks like following. I am trying to aggregate(sum) my amount column based on Date and Group present in pandas dataframe. I was able to successfully aggregate the column. However, I am not sure how to pass in fbprophet to predict the future values based on grouped date and Group. Below is the code for aggregation.
Note: I am beginner in python, please provide explanation with code.
Data Frame
import pandas as pd
data = {'Date':['2017-01-01', '2017-01-01','2017-01-01','2017-01-01','2017-01-01','2017-01-01','2017-01-01','2017-01-01',
'2017-02-01', '2017-02-01','2017-02-01','2017-02-01','2017-02-01','2017-02-01','2017-02-01','2017-02-01'],'Group':['A','A','B','B','C','C','D','D','A','A','B','B','C','C','D','D'],
'Amount':['12.1','13.2','15.1','10.7','12.9','9.0','5.6','6.7','4.3','2.3','4.0','5.6','7.8','2.3','5.6','8.9']}
df = pd.DataFrame(data)
Code tried so far:
grouped = df.groupby(['Group','Date'])[['Amount']].sum()
You're suffering from a few problems.
numeric
The 3rd line of your data initialization should pass in float rather than str.
Elide the quotes.
Or, this will fix it:
'Amount':[float(n) for n in ['12.1','13.2','15.1','10.7','12.9','9.0','5.6','6.7','4.3','2.3','4.0','5.6','7.8','2.3','5.6','8.9']]}
We do this because you really don't want .sum() to put together 12.1 and 13.2
and come up with '12.113.2'.
You'd much prefer 25.3.
index
The grouped object you computed looks good on the face of it,
but if you inspect the .dtypes attribute you'll see
it's only offering the Amount column to facebook prophet.
To remedy that, use .reset_index():
>>> grouped.reset_index(inplace=True)
>>> grouped.dtypes
Group object
Date object
Amount float64
dtype: object
But now we see one last fly in the ointment.
dates
Having opaque categories of 'A' or 'B' is fine,
but for Date we probably want to know that February or March
comes a certain number of days after January,
rather that leaving opaque str labels in that column.
We might have done the type conversion back when we presented the data input,
but it's fine to clean it up at this stage, too:
import datetime as dt
def to_timestamp(day: str):
return dt.datetime.strptime(day, '%Y-%m-%d')
grouped['Date'] = grouped.Date.apply(to_timestamp)
Having successfully wrangled the shape and types of your data,
you should now be in a good position to let libraries further analyze it.

How to Label encode multiple non-contiguous dataframe columns

I am having a pandas dataframe with multiple columns (some of them non-contiguous) which would need to be label encoded. From my understanding of the LabelEncoder class, for each column I would need to use a different LabelEncoder object. I am using the code below (list_of_string_cols in the code below is a list of all the columns which needs to be label encoded)
for col in list_of_string_cols:
labelenc = LabelEncoder()
train_X[col] = labelenc.fit_transform(train_X[col])
test_X[col] = labelenc.transform(test_X[col])
Is this the correct way?
Yes that's correct.
Since LabelEncoder was primarily made to deal with labels and not features, so it allowed only a single column at a time.
Up until the current version of scikit-learn (0.19.2), what you are using is the correct way of encoding multiple columns. See this question which also does what you are doing:
Label encoding across multiple columns in scikit-learn
From next version onwards (0.20), OrdinalEncoder can be used to encode all categorical feature columns at once.

Resources