Computing Feature Importance with OneHotEncoded Features

Computing Feature Importance with OneHotEncoded Features - scikit-learn

Is it possible to compute feature importance (with Random Forest) in scikit learn when features have been onehotencoded?

Here's an example of how to combine feature names with their importances:
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
# some example data
X = pd.DataFrame({'feature': ['value1', 'value2', 'value2', 'value1', 'value2']})
y = [1, 0, 0, 1, 1]
# translate rows to dicts
def row_to_dict(X, y=None):
return X.apply(dict, axis=1)
# define prediction model
ft = FunctionTransformer(row_to_dict, validate=False)
dv = DictVectorizer()
rf = RandomForestClassifier()
# glue steps together
model = make_pipeline(ft, dv, rf)
# train
model.fit(X, y)
# get feature importances
feature_importances = zip(dv.feature_names_, rf.feature_importances_)
# have a look
print feature_importances

Assuming that you have a pipeline with
a 'pre' step where you implement OneHotEncoder,
a 'clf' step where you define the classifier
the key of the categorical transformation is given as 'cat'
The following function will combine the feature importance of categorical features.
import numpy as np
import pandas as pd
import imblearn
def compute_feature_importance(model):
"""
Create feature importance using sklearn's ensemble models model.feature_importances_ property.
Parameters
----------
model : estimator instance (either sklearn.Pipeline, imblearn.Pipeline or a classifier)
PRE-FITTED classifier or a PRE-FITTED Pipeline in which the last estimator is a classifier.
Returns
-------
fi_df : Pandas DataFrame with feature_names and feature_importance
"""
if type(model) == imblearn.pipeline.Pipeline:
# If the user is using a pipeline model,
# the importance of the feature is calculated in this if block!
pre_model = model['pre'] # Pre step of the pipeline
classifier = model['clf'] # Classifier of the pipeline
ct = model.named_steps['pre'] # Define the column transform for the given pipeline model
# The following line will get the feature names.
feature_names = pre_model.get_feature_names_out()
feature_importance = np.array(classifier.feature_importances_)
# Create a DataFrame using a Dictionary
data = {'feature_names': feature_names, 'feature_importance': feature_importance}
fi_df = pd.DataFrame(data)
# Sort the DataFrame in order decreasing feature importance
fi_df.sort_values(by=['feature_importance'], ascending=False, inplace=True)
if 'cat' in ct.named_transformers_.keys() and hasattr(ct.named_transformers_['cat'], 'feature_names_in_'):
# We first have to apply a column transform and then sum up the feature importance values of individual OneHotEncoder columns.
# Original categorical features list. Categorical features before applying OneHotEncoder
original_cat_features = ct.named_transformers_['cat'].feature_names_in_.tolist()
# Categorical features list after applying OneHotEncoder
all_cat_list = ct.named_transformers_['cat'].get_feature_names_out(original_cat_features).tolist()
# A for loop for original_cat_features to find the one hot encoded features corresponding to each original categorical feature
for original_cat_feature in original_cat_features:
# List of one hot encoded features corresponding to each original categorical feature
cat_list = [i for i in all_cat_list if i.startswith(original_cat_feature)]
# OneHotEncoded columns must be renamed.
# ct.named transformers['cat'].get_feature_names_out(original cat_features) returns column names missing "cat__" in front.
# Let's fix it easily!
for i, element in enumerate(cat_list):
cat_list[i] = 'cat__' + element
# Slice fi_df dataframe to return the only rows for the associated OneHotEncoded features names (cat_list)
# and then sum the feature importance values
cat_sum = fi_df[fi_df['feature_names'].isin(cat_list)]['feature_importance'].sum()
# Slice fi_df dataframe to return the only rows other than categorical features.
# In other words, dataframe with numerical features
fi_df = fi_df[~fi_df['feature_names'].isin(cat_list)]
# Create a temporary dictionary to return the originial categorical feature
# and the summation of OneHotEncoded features importances
temp_dict = {'feature_names': original_cat_feature, 'feature_importance': cat_sum}
# Append the temporary_dict to the dataframe
fi_df = fi_df.append(temp_dict, ignore_index=True)
# Sort the DataFrame in order decreasing feature importance
fi_df.sort_values(by=['feature_importance'], ascending=False, inplace=True)
# Getting rid of the prefixes of the feature names
prefixes = ('num__', 'cat__', 'remainder__', 'scaler__')
for prefix in prefixes:
fi_df['feature_names'] = fi_df['feature_names'].apply(lambda x: str(x).replace(prefix,""))
return fi_df

Related

Sklearn's TfidfTransformer(use_idf=False, norm=None) returns the same output as CountVectorizer()

I am trying to understand the code behind TfidfTransformer(). From sklearn's documentation, I can get the term frequencies by setting use_idf=False. But when I check the code on Github, I noticed that the TfidfTransformer() will return the same value as CountVectorizer() when not using normalization, which is just the count of each term.
The code that is supposed to calculate term frequencies.
def transform(self, x, copy=True):
"""Transform a count matrix to a tf or tf-idf representation.
Parameters
----------
X : sparse matrix of (n_samples, n_features)
A matrix of term/token counts.
copy : bool, default=True
Whether to copy X and operate on the copy or perform in-place
operations.
Returns
-------
vectors : sparse matrix of shape (n_samples, n_features)
Tf-idf-weighted document-term matrix.
"""
X = self._validate_data(
X, accept_sparse="csr", dtype=FLOAT_DTYPES, copy-copy, reset=False
)
if not sp.issparse(X):
X = sp.csr_matrix(X, dtype=np.float64)
if self.sublinear_tf:
np.log(X.data, X.data)
X.data += 1
if self.use_idf:
# idf being a property, the automatic attributes detection
# does not work as usual and we need to specify the attribute not fitted")
# name:
check_is_fitted (self, attributes=["idf_"], msg="idf vector is not fitted")
# *= doesn't work
X = X * self._idf_diag
if self.norm is not None:
X = normalize(X, norm=self.norm, copy=False)
return X
image of code above
To investigate more, I ran both classes and compared the output of both CountVectorizer and TfidfTransformer using the following code and the output is equal.
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=(
'headers', 'footers', 'quotes'), subset='train', categories=['sci.electronics', 'rec.autos', 'rec.sport.hockey'])
train_documents = dataset.data
vectorizer = CountVectorizer()
train_documents_mat = vectorizer.fit_transform(train_documents)
tf_vectorizer = TfidfTransformer(use_idf=False, norm=None)
train_documents_mat_2 = tf_vectorizer.fit_transform(train_documents_mat)
equal = np.array_equal(
train_documents_mat.toarray(),
train_documents_mat_2.toarray()
)
print(equal)
I am trying to get the term frequencies for my documents rather than just the count. Any ideas why sklearn implement TF-IDF in this way?

How to get a specific sample from pytorch DataLoader?

In Pytorch, is there any way of loading a specific single sample using the torch.utils.data.DataLoader class? I'd like to do some testing with it.
The tutorial uses
trainloader = torch.utils.data.DataLoader(...)
images, labels = next(iter(trainloader))
to fetch a random batch of samples. Is there are way, using DataLoader, to get a specific sample?
Cheers

Turn off the shuffle in DataLoader
Use batch_size to calculate the batch in which the desired sample you are looking for falls in
Iterate to the desired batch
Code
import torch
import numpy as np
import itertools
X= np.arange(100)
batch_size = 2
dataloader = torch.utils.data.DataLoader(X, batch_size=batch_size, shuffle=False)
sample_at = 5
k = int(np.floor(sample_at/batch_size))
my_sample = next(itertools.islice(dataloader, k, None))
print (my_sample)
Output:
tensor([4, 5])

if you want to get a specific signle sample from your dataset you can
you should check Subset class.(https://pytorch.org/docs/stable/data.html#torch.utils.data.Subset)
something like this:
indices = [0,1,2] # select your indices here as a list
subset = torch.utils.data.Subset(train_set, indices)
trainloader = DataLoader(subset , batch_size = 16 , shuffle =False) #set shuffle to False
for image , label in trainloader:
print(image.size() , '\t' , label.size())
print(image[0], '\t' , label[0]) # index the specific sample
here is a useful link if you want to learn more about the Pytorch data loading utility
(https://pytorch.org/docs/stable/data.html)

Grid search tunung

I am implementing KNN using python and it was working.
Now I get an error:
No module named 'sklearn.grid_search
When I change the package to sklean.model_selection, I get another an error:
'GridSearchCV' object has no attribute 'grid_scores_'
Here is my code:
from sklearn.grid_search import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
# define the parameter values that should be searched
# for python 2, k_range = range(1, 31)
# instantiate model
knn = KNeighborsClassifier(n_jobs=-1)
k_range = list(range(1, 31))
print(k_range)
# create a parameter grid: map the parameter names to the values that should be searched
# simply a python dictionary
# key: parameter name
# value: list of values that should be searched for that parameter
# single key-value pair for param_grid
param_grid = dict(n_neighbors=k_range)
print(param_grid)
# instantiate the grid
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy')
# fit the grid with data
grid.fit(X, y)
# view the complete results (list of named tuples)
grid.grid_scores_
# examine the first tuple
# we will slice the list and select its elements using dot notation and []
print('Parameters')
print(grid.grid_scores_[0].parameters)
# Array of 10 accuracy scores during 10-fold cv using the parameters
print('')
print('CV Validation Score')
print(grid.grid_scores_[0].cv_validation_scores)
# Mean of the 10 scores
print('')
print('Mean Validation Score')
print(grid.grid_scores_[0].mean_validation_score)
# create a list of the mean scores only
# list comprehension to loop through grid.grid_scores
grid_mean_scores = [result.mean_validation_score for result in grid.grid_scores_]
print(grid_mean_scores)
# plot the results
# this is identical to the one we generated above
plt.plot(k_range, grid_mean_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')
# examine the best model
# Single best score achieved across all params (k)
print(grid.best_score_)
# Dictionary containing the parameters (k) used to generate that score
print(grid.best_params_)
# Actual model object fit with those best parameters
# Shows default parameters that
We did not specify:
print(grid.best_estimator_)

Try the below :
from sklearn.model_selection import GridSearchCV
ref link
https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html

Pyspark random forest feature importance mapping after column transformations

I am trying to plot the feature importances of certain tree based models with column names. I am using Pyspark.
Since I had textual categorical variables and numeric ones too, I had to use a pipeline method which is something like this -
use string indexer to index string columns
use one hot encoder for all columns
use a vectorassembler to create the feature column containing the feature vector
Some sample code from the docs for steps 1,2,3 -
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer,
VectorAssembler
categoricalColumns = ["workclass", "education", "marital_status",
"occupation", "relationship", "race", "sex", "native_country"]
stages = [] # stages in our Pipeline
for categoricalCol in categoricalColumns:
# Category Indexing with StringIndexer
stringIndexer = StringIndexer(inputCol=categoricalCol,
outputCol=categoricalCol + "Index")
# Use OneHotEncoder to convert categorical variables into binary
SparseVectors
# encoder = OneHotEncoderEstimator(inputCol=categoricalCol + "Index",
outputCol=categoricalCol + "classVec")
encoder = OneHotEncoderEstimator(inputCols=
[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
# Add stages. These are not run here, but will run all at once later on.
stages += [stringIndexer, encoder]
numericCols = ["age", "fnlwgt", "education_num", "capital_gain",
"capital_loss", "hours_per_week"]
assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]
# Create a Pipeline.
pipeline = Pipeline(stages=stages)
# Run the feature transformations.
# - fit() computes feature statistics as needed.
# - transform() actually transforms the features.
pipelineModel = pipeline.fit(dataset)
dataset = pipelineModel.transform(dataset)
finally train the model
after training and eval, I can use the "model.featureImportances" to get the feature rankings, however I dont get the feature/column names, rather just the feature number, something like this -
print dtModel_1.featureImportances
(38895,[38708,38714,38719,38720,38737,38870,38894],[0.0742343395738,0.169404823667,0.100485791055,0.0105823115814,0.0134236162982,0.194124862158,0.437744255667])
How do I map it back to the initial column names and the values? So that I can plot ?**

Extract metadata as shown here by user6910411
attrs = sorted(
(attr["idx"], attr["name"])
for attr in (
chain(*dataset.schema["features"].metadata["ml_attr"]["attrs"].values())
)
)
and combine with feature importance:
[
(name, dtModel_1.featureImportances[idx])
for idx, name in attrs
if dtModel_1.featureImportances[idx]
]

The transformed dataset metdata has the required attributes.Here is an easy way to do -
create a pandas dataframe (generally feature list will not be huge, so no memory issues in storing a pandas DF)
pandasDF = pd.DataFrame(dataset.schema["features"].metadata["ml_attr"]
["attrs"]["binary"]+dataset.schema["features"].metadata["ml_attr"]["attrs"]["numeric"]).sort_values("idx")
Then create a broadcast dictionary to map. broadcast is necessary in a distributed environment.
feature_dict = dict(zip(pandasDF["idx"],pandasDF["name"]))
feature_dict_broad = sc.broadcast(feature_dict)

When creating your assembler you used a list of variables (assemblerInputs). The order is preserved in 'features' variable. So just do a Pandas DataFrame:
features_imp_pd = (
pd.DataFrame(
dtModel_1.featureImportances.toArray(),
index=assemblerInputs,
columns=['importance'])
)

How to predict Label of an email using a trained NB Classifier in sklearn?

I have created a Gaussian Naive Bayes classifier on a email (spam/not spam) dataset and was able to run it successfully. I vectorized the data, divided in it train and test sets and then calculated the accuracy, all the features that are present in the sklearn-Gaussian Naive Bayes classifier.
Now I want to be able to use this classifier to predict "labels" for new emails - whether they are by spam or not.
For example say I have an email. I want to feed it to my classifier and get the prediction as to whether it is a spam or not. How can I achieve this? Please Help.
Code for classifier file.
#!/usr/bin/python
import sys
from time import time
import logging
# Display progress logs on stdout
logging.basicConfig(level = logging.DEBUG, format = '%(asctime)s %(message)s')
sys.path.append("../DatasetProcessing/")
from vectorize_split_dataset import preprocess
### features_train and features_test are the features
for the training and testing datasets, respectively### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()
#########################################################
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
t0 = time()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print("training time:", round(time() - t0, 3), "s")
print(clf.score(features_test, labels_test))
## Printing Metrics
for Training and Testing
print("No. of Testing Features:" + str(len(features_test)))
print("No. of Testing Features Label:" + str(len(labels_test)))
print("No. of Training Features:" + str(len(features_train)))
print("No. of Training Features Label:" + str(len(labels_train)))
print("No. of Predicted Features:" + str(len(pred)))
## Calculating Classifier Performance
from sklearn.metrics import classification_report
y_true = labels_test
y_pred = pred
labels = ['0', '1']
target_names = ['class 0', 'class 1']
print(classification_report(y_true, y_pred, target_names = target_names, labels = labels))
# How to predict label of a new text
new_text = "You won a lottery at UK lottery commission. Reply to claim it"
Code for Vectorization
#!/usr/bin/python
import os
import pickle
import numpy
numpy.random.seed(42)
path = os.path.dirname(os.path.abspath(__file__))
### The words(features) and label_data(labels), already largely processed.###These files should have been created beforehand
feature_data_file = path + "./createdDataset/dataSet.pkl"
label_data_file = path + "./createdDataset/dataLabel.pkl"
feature_data = pickle.load(open(feature_data_file, "rb"))
label_data = pickle.load(open(label_data_file, "rb"))
### test_size is the percentage of events assigned to the test set(the### remainder go into training)### feature matrices changed to dense representations
for compatibility with### classifier functions in versions 0.15.2 and earlier
from sklearn import cross_validation
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(feature_data, label_data, test_size = 0.1, random_state = 42)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf = True, max_df = 0.5, stop_words = 'english')
features_train = vectorizer.fit_transform(features_train)
features_test = vectorizer.transform(features_test)#.toarray()
## feature selection to reduce dimensionality
from sklearn.feature_selection import SelectPercentile, f_classif
selector = SelectPercentile(f_classif, percentile = 5)
selector.fit(features_train, labels_train)
features_train_transformed_reduced = selector.transform(features_train).toarray()
features_test_transformed_reduced = selector.transform(features_test).toarray()
features_train = features_train_transformed_reduced
features_test = features_test_transformed_reduced
def preprocess():
return features_train, features_test, labels_train, labels_test
Code for dataset generation
#!/usr/bin/python
import os
import pickle
import re
import sys
# sys.path.append("../tools/")
""
"
Starter code to process the texts of accuate and inaccurate category to extract
the features and get the documents ready for classification.
The list of all the texts from accurate category are in the accurate_files list
likewise for texts of inaccurate category are in (inaccurate_files)
The data is stored in lists and packed away in pickle files at the end.
"
""
accurate_files = open("./rawDatasetLocation/accurateFiles.txt", "r")
inaccurate_files = open("./rawDatasetLocation/inaccurateFiles.txt", "r")
label_data = []
feature_data = []
### temp_counter is a way to speed up the development--there are### thousands of lines of accurate and inaccurate text, so running over all of them### can take a long time### temp_counter helps you only look at the first 200 lines in the list so you### can iterate your modifications quicker
temp_counter = 0
for name, from_text in [("accurate", accurate_files), ("inaccurate", inaccurate_files)]:
for path in from_text: ###only look at first 200 texts when developing### once everything is working, remove this line to run over full dataset
temp_counter = 1
if temp_counter < 200:
path = os.path.join('..', path[: -1])
print(path)
text = open(path, "r")
line = text.readline()
while line: ###use a
function parseOutText to extract the text from the opened text# stem_text = parseOutText(text)
stem_text = text.readline().strip()
print(stem_text)### use str.replace() to remove any instances of the words# stem_text = stem_text.replace("germani", "")### append the text to feature_data
feature_data.append(stem_text)### append a 0 to label_data
if text is from Sara, and 1
if text is from Chris
if (name == "accurate"):
label_data.append("0")
elif(name == "inaccurate"):
label_data.append("1")
line = text.readline()
text.close()
print("texts processed")
accurate_files.close()
inaccurate_files.close()
pickle.dump(feature_data, open("./createdDataset/dataSet.pkl", "wb"))
pickle.dump(label_data, open("./createdDataset/dataLabel.pkl", "wb"))
Also I want to know whether i can incrementally train the classifier meaning thereby that retrain a created model with newer data for refining the model over time?
I would be really glad if someone can help me out with this. I am really stuck at this point.

You are already using your model to predict labels of emails in your test set. This is what pred = clf.predict(features_test) does. If you want to see these labels, do print pred.
But perhaps you what to know how you can predict labels for emails that you discover in the future and that are not currently in your test set? If so, you can think of your new email(s) as a new test set. As with your previous test set, you will need to run several key processing steps on the data:
1) The first thing you need to do is to generate features for your new email data. The feature generation step is not included in your code above, but will need to occur.
2) You are using a Tfidf vectorizer, which converts a collection of documents to a matrix of Tfidf features based upon term frequency and inverse document frequency. You need to put your new email test feature data through the vectorizer that you fit on your training data.
3) Then your new email test feature data will need to go through dimensionality reduction using the same selector that you fit on your training data.
4) Finally, run predict on your new test data. Use print pred if you want to view the new label(s).
To respond to your final question about iteratively re-training your model, yes you definitely can do this. It's just a matter of selecting a frequency, producing a script that expands your data set with incoming data, then re-running all steps from there, from pre-processing to Tfidf vectorization, to dimensionality reduction, to fitting, and prediction.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string