How to preprocess string columns of NSL-KDD? - python-3.x

I have a NSL-KDD dataset, where it has some string columns
I have checked some code from github, most of whom just use one-hot encoder to transform them as the following.
I want to know whether I can transform these string columns as numerical column with 3 column rather than 84 one-hot encoder column. After all, they create too wide sparse space, which may make training accuracy worse.

Related

Applying OneHotEncoding on categorical data with missing values

I want to OneHotEncode a pd.DataFrame with missing values.When I try to OneHotEncode, it throws an error regarding missing values.
ValueError: Input contains NaN
When I try to use a SimpleImputer to fix missing values, it throws an error regarding categorical data
ValueError: Cannot use mean strategy with non-numeric data:
could not convert string to float: 'RH'
I can't apply OneHotEncoding because of missing values and SimpleImputer because of categorical data.
Is there a way around this besides dropping columns or rows?
You can use either of the below 2 methods to eliminate NaN categorical values -
Option 1: Replace the missing values with the most frequent category. For instance, if you have a column with 51% values belonging to one category then use the below code to fill missing values of that category
df['col_name'].fillna('most_frequent_category',inplace=True)
Option 2: If you don't wish to impute missing values to the most frequent category then you can create a new category called 'Other' (or similar neutral category relevant to your variable)
df['col_name'].fillna('Other',inplace=True)
Both these methods will impute your missing categorical values and then you will be able to OneHotEncode them.

How to preprocess test data after one hot encoding

I am a bit confused here, I have one hot encoded my categorical columns for all those < 10 unique values low_cardinality_cols , and dropped the remaining categorical columns for both Training and validation data.
Now I aim to apply my model to new data in a test.csv that. What would be the best method for pre-processing the test data to match train/validation format?
My concerns are:
1. Test_data.csv will certainly have different cardinality for those columns
2. If I one hot encode test data using low cardinality columns from training I get Input contains NaN but my train, valid & test columns are all the same number.
Sample one hot encoding below, this is for kaggle competition/intermediate course here
# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))
# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index
# Remove categorical columns (will replace with one-hot encoding)
# This also saves us the hassle of dropping columns
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)
# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
I would advise 2 things:
OneHotEncoder is a parameter handle_unknown="error" per default. It should be turned to handle_unknow="ignore" in the case that you mention (categories in testing not known during training).
Use a scikit-learn pipeline including your predictor instead of calling fit_transform and transform and then give the data to the predictor
As far as I could think, there are two possible solutions to this, I will illustrate both here and you can pick whichever works for you.
Solution 1
If it is possible for you to get all the possible levels/values of the categorical variable that you are planning to encode, you can pass them as the categories parameter when you perform one-hot encoding the default value for categories is auto which determines the categories automatically from the training data and will not account for the new categories found in testing data. Enforcing categories as a list of all possible categories will help us solve this problem. As even if your testing data has new categories that were not present in training/validation data they will all be encoded correctly and you won't be getting NaNs.
Solution 2
In case if you are not able to collect all possible categories of a categorical column you can go ahead and fit the one-hot encoder the way you have done, and when you try to transform your test data in order to handle NaNs which you will be encountering when you find a new class, you can use some kind of imputation techniques like SimpleImputer or IterativeImputer to impute the missing values and process further.

How to Label encode multiple non-contiguous dataframe columns

I am having a pandas dataframe with multiple columns (some of them non-contiguous) which would need to be label encoded. From my understanding of the LabelEncoder class, for each column I would need to use a different LabelEncoder object. I am using the code below (list_of_string_cols in the code below is a list of all the columns which needs to be label encoded)
for col in list_of_string_cols:
labelenc = LabelEncoder()
train_X[col] = labelenc.fit_transform(train_X[col])
test_X[col] = labelenc.transform(test_X[col])
Is this the correct way?
Yes that's correct.
Since LabelEncoder was primarily made to deal with labels and not features, so it allowed only a single column at a time.
Up until the current version of scikit-learn (0.19.2), what you are using is the correct way of encoding multiple columns. See this question which also does what you are doing:
Label encoding across multiple columns in scikit-learn
From next version onwards (0.20), OrdinalEncoder can be used to encode all categorical feature columns at once.

how to preserve number of records in word2vec?

I have 45000 text records in my dataframe. I wanted to convert those 45000 records into word vectors so that I can train a classifier on the word vector. I am not tokenizing the sentences. I just split the each entry into list of words.
After training word2vec model with 300 features, the shape of the model resulted in only 26000. How can I preserve all of my 45000 records ?
In the classifier model, I need all of those 45000 records, so that it can match 45000 output labels.
If you are splitting each entry into a list of words, that's essentially 'tokenization'.
Word2Vec just learns vectors for each word, not for each text example ('record') – so there's nothing to 'preserve', no vectors for the 45,000 records are ever created. But if there are 26,000 unique words among the records (after applying min_count), you will have 26,000 vectors at the end.
Gensim's Doc2Vec (the '
Paragraph Vector' algorithm) can create a vector for each text example, so you may want to try that.
If you only have word-vectors, one simplistic way to create a vector for a larger text is to just add all the individual word vectors together. Further options include choosing between using the unit-normed word-vectors or raw word-vectors of many magnitudes; whether to then unit-norm the sum; and whether to otherwise weight the words by any other importance factor (such as TF/IDF).
Note that unless your documents are very long, this is a quite small training set for either Word2Vec or Doc2Vec.

guidelines to handle missing categorical feature values in Random Forest Regressor

What is a general guideline to handle missing categorical feature values when using Random Forest Regressor (or any ensemble learner for that matter)? I know that scikit learn has impute function (like mean...strategy or proximity) to impute missing values (numerical). But, how does one handle missing categorical value : Like Industry (oil, computer, auto, None), major(bachelors, masters, doctoral, None).
Any suggestion is appreciated.
Breiman and Cutler, the inventors of Random Forest, suggest two possible strategies (see http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1):
Random forests has two ways of replacing missing values. The first way
is fast. If the mth variable is not categorical, the method computes
the median of all values of this variable in class j, then it uses
this value to replace all missing values of the mth variable in class
j. If the mth variable is categorical, the replacement is the most
frequent non-missing value in class j. These replacement values are
called fills.
The second way of replacing missing values is computationally more
expensive but has given better performance than the first, even with
large amounts of missing data. It replaces missing values only in the
training set. It begins by doing a rough and inaccurate filling in of
the missing values. Then it does a forest run and computes
proximities.
Alternatively, leaving your label variable aside for a minute, you could train a classifier on rows that have non-null values for the categorical variable in question, using all of your features in the classifier. Then use this classifier to predict values for the categorical variable in question in your 'test set'. Armed with a more complete data set, you can now return to the task of predicting values for your original label variable.

Resources