Pyspark random forest feature importance mapping after column transformations - apache-spark

I am trying to plot the feature importances of certain tree based models with column names. I am using Pyspark.
Since I had textual categorical variables and numeric ones too, I had to use a pipeline method which is something like this -
use string indexer to index string columns
use one hot encoder for all columns
use a vectorassembler to create the feature column containing the feature vector
Some sample code from the docs for steps 1,2,3 -
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer,
VectorAssembler
categoricalColumns = ["workclass", "education", "marital_status",
"occupation", "relationship", "race", "sex", "native_country"]
stages = [] # stages in our Pipeline
for categoricalCol in categoricalColumns:
# Category Indexing with StringIndexer
stringIndexer = StringIndexer(inputCol=categoricalCol,
outputCol=categoricalCol + "Index")
# Use OneHotEncoder to convert categorical variables into binary
SparseVectors
# encoder = OneHotEncoderEstimator(inputCol=categoricalCol + "Index",
outputCol=categoricalCol + "classVec")
encoder = OneHotEncoderEstimator(inputCols=
[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
# Add stages. These are not run here, but will run all at once later on.
stages += [stringIndexer, encoder]
numericCols = ["age", "fnlwgt", "education_num", "capital_gain",
"capital_loss", "hours_per_week"]
assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]
# Create a Pipeline.
pipeline = Pipeline(stages=stages)
# Run the feature transformations.
# - fit() computes feature statistics as needed.
# - transform() actually transforms the features.
pipelineModel = pipeline.fit(dataset)
dataset = pipelineModel.transform(dataset)
finally train the model
after training and eval, I can use the "model.featureImportances" to get the feature rankings, however I dont get the feature/column names, rather just the feature number, something like this -
print dtModel_1.featureImportances
(38895,[38708,38714,38719,38720,38737,38870,38894],[0.0742343395738,0.169404823667,0.100485791055,0.0105823115814,0.0134236162982,0.194124862158,0.437744255667])
How do I map it back to the initial column names and the values? So that I can plot ?**

Extract metadata as shown here by user6910411
attrs = sorted(
(attr["idx"], attr["name"])
for attr in (
chain(*dataset.schema["features"].metadata["ml_attr"]["attrs"].values())
)
)
and combine with feature importance:
[
(name, dtModel_1.featureImportances[idx])
for idx, name in attrs
if dtModel_1.featureImportances[idx]
]

The transformed dataset metdata has the required attributes.Here is an easy way to do -
create a pandas dataframe (generally feature list will not be huge, so no memory issues in storing a pandas DF)
pandasDF = pd.DataFrame(dataset.schema["features"].metadata["ml_attr"]
["attrs"]["binary"]+dataset.schema["features"].metadata["ml_attr"]["attrs"]["numeric"]).sort_values("idx")
Then create a broadcast dictionary to map. broadcast is necessary in a distributed environment.
feature_dict = dict(zip(pandasDF["idx"],pandasDF["name"]))
feature_dict_broad = sc.broadcast(feature_dict)

When creating your assembler you used a list of variables (assemblerInputs). The order is preserved in 'features' variable. So just do a Pandas DataFrame:
features_imp_pd = (
pd.DataFrame(
dtModel_1.featureImportances.toArray(),
index=assemblerInputs,
columns=['importance'])
)

Related

How to combine a pipeline for all types of features, for categorical features and numerical features in one ColumnTransformerr?

Im trying to create a pipeline that combines :
Pipeline for all kinds of features, no matter the type (cleaning incorrect data by feature)
Pipeline for categorical features (categorical imputer)
Pipeline for numerical features (numerical imputer)
in a sklearn.compose.ColumnTransformer¶.
This here is a piece of code for what I'm trying to do
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
alltypes = Pipeline([
('column_name_normalizer',ColumnNameNormalizer()),
('column_incorrect_data_cleaner',ColumnIncorrectDataCleaner(some_parameter),
])
num_pipeline = Pipeline([
('imputer',CustomNumImputer(some_parameter)), # remplir les valeurs manquants
])
cat_pipeline = Pipeline([
("cat", CustomCatImputer(some_parameter))
])
full_pipeline = ColumnTransformer([
("alltypes",alltypes,allcolumns),
("num", num_pipeline, numfeat),
("cat",cat_pipeline,catfeat)
])
try:
X = pd.DataFrame(full_pipeline.fit_transform(X).toarray())
except AttributeError:
X = pd.DataFrame(full_pipeline.fit_transform(X))
However in the end I get a dataframe with more number of features than at the beginning which is due to the fact that all the features from the pipelines are concatenated, instead of an operator UNION being performed on them:
For instance I want to do some transformations on all features, then do some transformations on categorical features, and do some transformations on numerical features, but I want the outputing dataframe to be always the same size.
Do you know how can I fix this?
You need to combine the sequential power of Pipeline, e.g.
cat_num_split = ColumnTransformer([
("num", num_pipeline, numfeat),
("cat", cat_pipeline, catfeat),
])
full_pipeline = Pipeline([
("alltypes", alltypes),
("cat_num", cat_num_split),
)]
There is a catch here: the alltypes transformer will result in a numpy array without information about which columns are which; your cat_num_split feature lists numfeat and catfeat will rely on your knowledge of which columns are which, and cannot use the column names.
An alternative, that doesn't run into the feature name issue, is to switch the order here.
num_full_pipe = Pipeline([
("common", alltypes),
("num", num_pipeline),
])
cat_full_pipe = Pipeline([
("common", alltypes),
("cat", cat_pipeline),
])
full_pipeline = ColumnTransformer([
("num", num_full_pipe, numfeat),
("cat", cat_full_pipe, catfeat),
])
See also Consistent ColumnTransformer for intersecting lists of columns.

How to get feature importances/feature ranking from summary plot in SHAP without crashing?

I am attempting to get shap values out of an array which was created by
explainer = shap.Explainer(xg_clf, X_train)
shap_values2 = explainer(X_train)
using my XGBoost data, to make a dataframe of feature_names and their SHAP importance, as they would appear in a SHAP bar or summary plot.
Following advice from how to extract the most important feature names? and How to get feature names of shap_values from TreeExplainer? specifically the comment by user Thoo, which shows how the values can be extracted to make a dataframe:
vals= np.abs(shap_values).mean(0)
feature_importance = pd.DataFrame(list(zip(X_train.columns,vals)),columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'],ascending=False,inplace=True)
feature_importance.head()
shap_values has 11595 persons with 595 features each, which I understand is large, but, creating the vals variable runs very slowly, about 58 minutes on my laptop. It uses almost all RAM on the computer.
After 58 minutes I get an error:
Command terminated by signal 9
which as far as I understand, means that the computer ran out of RAM.
I've tried converting the 2nd line in Thoo's code to
feature_importance = pd.DataFrame(list(zip(X_train.columns,np.abs(shap_values2).mean(0))),columns=['col_name','feature_importance_vals'])
so that vals isn't stored but this change doesn't reduce RAM at all.
I've also tried a different comment from the same GitHub issue (user "ba1mn"):
def global_shap_importance(model, X):
""" Return a dataframe containing the features sorted by Shap importance
Parameters
----------
model : The tree-based model
X : pd.Dataframe
training set/test set/the whole dataset ... (without the label)
Returns
-------
pd.Dataframe
A dataframe containing the features sorted by Shap importance
"""
explainer = shap.Explainer(model)
shap_values = explainer(X)
cohorts = {"": shap_values}
cohort_labels = list(cohorts.keys())
cohort_exps = list(cohorts.values())
for i in range(len(cohort_exps)):
if len(cohort_exps[i].shape) == 2:
cohort_exps[i] = cohort_exps[i].abs.mean(0)
features = cohort_exps[0].data
feature_names = cohort_exps[0].feature_names
values = np.array([cohort_exps[i].values for i in range(len(cohort_exps))])
feature_importance = pd.DataFrame(
list(zip(feature_names, sum(values))), columns=['features', 'importance'])
feature_importance.sort_values(
by=['importance'], ascending=False, inplace=True)
return feature_importance
but global_shap_importance returns the feature importances in the wrong order, and I don't see how I can alter global_shap_importance so that the features are returned in the same order as summary_plot (beeswarm plot).
How can I get the feature importance ranking into a dataframe?
I pulled this straight from the source code. Confirmed identical to the summary_plot.
def shapley_feature_ranking(shap_values, X):
feature_order = np.argsort(np.mean(np.abs(shap_values), axis=0))
return pd.DataFrame(
{
"features": [X.columns[i] for i in feature_order][::-1],
"importance": [
np.mean(np.abs(shap_values), axis=0)[i] for i in feature_order
][::-1],
}
)
shapley_feature_ranking(shap_values[0], X)

Is there a way to add a 'sentiment' column after applying CountVectorizer or TfIdfTransformer to a dataframe?

I am working with app store reviews to classify them as class "0" or class "1" based on the text in the review and the sentiment the review carries.
In my classification steps I apply the following methods to my dataframe:
def get_sentiment(s):
vs = analyzer.polarity_scores(s)
if vs['compound'] >= 0.5:
return 1
elif vs['compound'] <= -0.5:
return -1
else:
return 0
df['sentiment'] = df['review'].apply(get_sentiment)
For simplicity sake, the data has already been labeled as either class '0' or '1', but I am training the model for the classification of new instances that have not been labeled yet. In short, the data I'm working with has already been labeled. They are in the classification column.
Then in my train test split method do the following:
msg_train, msg_test, label_train, label_test = train_test_split(df.drop('classification', axis=1), df['classification'], test_size=0.3, random_state=42)
So the dataframe for the X parameter has review and sentiment, and for the y parameter I only have the classification that I am training my model on.
Since the normalization is repetitive, I am running a pipeline like so for simplicity:
pipeline1 = Pipeline([
('bow', CountVectorizer(analyzer=clean_review)),
('tfidf', TfidfTransformer()),
('classifier', MultinomialNB())
])
Where the clean_review function is as follows:
def clean_review(sentence):
no_punc = [c for c in sentence if c not in string.punctuation]
no_punc = ''.join(no_punc)
no_stopwords = [w.lower() for w in no_punc.split() if w not in stopwords_set]
stemmed_words = [ps.stem(w) for w in no_stopwords]
return stemmed_words
Where stopwords_set is the collection of english stopwords from the nltk library, and ps is from the PortStemmer module in the nltk library (for word stemming).
I get the following error: ValueError: Found input variables with inconsistent numbers of samples: [2, 505]
When I searched this error before, I saw that the likely issue could've been that there is a mismatch in the number of records for each attribute. I've found this not to be the case. All the records that I am using have values for every column.
Can someone else help me interpret what this error could mean?
My end goal is to have a dataframe that has the CountVectorizer and TfIdfTransformer applied to the text, but also retain the column for the sentiment of each review.
I would then like to be able to train the MultinomialNB classifier on this dataframe and apply this model to other tasks.
I'm not sure on what the error is due to since I don't know what the size of your dataframe should be. I would need more information. On which line is the error thrown?
Regarding the fact that you want to retain the sentiment column, you could apply CountVectorizer and TfIdfTransformer (by the way you could skip a step and directly apply TfidfVectorizer) only on the text data and then have another transformer in the pipeline which adds the original sentiment column before you feed the dataframe to the classifier.

PySpark feature selection and interpretability

Is there a way in PySpark to perform feature selection, but preserve or obtain a mapping back to the original feature indices/descriptions?
For example:
I have a StringArray column of raw feature strings (col =
"rawFeatures").
I've converted them to numerical counts using
CountVectorizer (col = "features").
Then I've run the ChiSqSelector
to select the top 1000 features (col = "selectedFeatures).
How do I get the raw feature strings that correspond to those top 1000 features (or even just the corresponding indices of these selected features in the original "features" col from step #2)?
This information can be traced back using fitted Transformers. With Pipeline like this one:
from pyspark.ml.feature import *
from pyspark.ml import Pipeline
import numpy as np
data = spark.createDataFrame(
[(1, ["spark", "foo", "bar"]), (0, ["kafka", "bar", "foo"])],
("label", "rawFeatures"))
model = Pipeline(stages = [
CountVectorizer(inputCol="rawFeatures", outputCol="features"),
ChiSqSelector(outputCol="selectedFeatures", numTopFeatures=2)
]).fit(data)
you can extract Transformers:
vectorizer, chisq = model.stages
and compare selectedFeatures with vocabulary:
np.array(vectorizer.vocabulary)[chisq.selectedFeatures]
array(['spark', 'kafka'], dtype='<U5')
Unfortunately this combination of Transformers doesn't preserve labels metadata:
features_meta, selected_features_meta = (f.metadata for f in model
.transform(data).select("features", "selectedFeatures")
.schema
.fields)
features_meta
{}
selected_features_meta
{'ml_attr': {'attrs': {'nominal': [{'idx': 0}, {'idx': 1}]}, 'num_attrs': 2}}

Computing Feature Importance with OneHotEncoded Features

Is it possible to compute feature importance (with Random Forest) in scikit learn when features have been onehotencoded?
Here's an example of how to combine feature names with their importances:
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
# some example data
X = pd.DataFrame({'feature': ['value1', 'value2', 'value2', 'value1', 'value2']})
y = [1, 0, 0, 1, 1]
# translate rows to dicts
def row_to_dict(X, y=None):
return X.apply(dict, axis=1)
# define prediction model
ft = FunctionTransformer(row_to_dict, validate=False)
dv = DictVectorizer()
rf = RandomForestClassifier()
# glue steps together
model = make_pipeline(ft, dv, rf)
# train
model.fit(X, y)
# get feature importances
feature_importances = zip(dv.feature_names_, rf.feature_importances_)
# have a look
print feature_importances
Assuming that you have a pipeline with
a 'pre' step where you implement OneHotEncoder,
a 'clf' step where you define the classifier
the key of the categorical transformation is given as 'cat'
The following function will combine the feature importance of categorical features.
import numpy as np
import pandas as pd
import imblearn
def compute_feature_importance(model):
"""
Create feature importance using sklearn's ensemble models model.feature_importances_ property.
Parameters
----------
model : estimator instance (either sklearn.Pipeline, imblearn.Pipeline or a classifier)
PRE-FITTED classifier or a PRE-FITTED Pipeline in which the last estimator is a classifier.
Returns
-------
fi_df : Pandas DataFrame with feature_names and feature_importance
"""
if type(model) == imblearn.pipeline.Pipeline:
# If the user is using a pipeline model,
# the importance of the feature is calculated in this if block!
pre_model = model['pre'] # Pre step of the pipeline
classifier = model['clf'] # Classifier of the pipeline
ct = model.named_steps['pre'] # Define the column transform for the given pipeline model
# The following line will get the feature names.
feature_names = pre_model.get_feature_names_out()
feature_importance = np.array(classifier.feature_importances_)
# Create a DataFrame using a Dictionary
data = {'feature_names': feature_names, 'feature_importance': feature_importance}
fi_df = pd.DataFrame(data)
# Sort the DataFrame in order decreasing feature importance
fi_df.sort_values(by=['feature_importance'], ascending=False, inplace=True)
if 'cat' in ct.named_transformers_.keys() and hasattr(ct.named_transformers_['cat'], 'feature_names_in_'):
# We first have to apply a column transform and then sum up the feature importance values of individual OneHotEncoder columns.
# Original categorical features list. Categorical features before applying OneHotEncoder
original_cat_features = ct.named_transformers_['cat'].feature_names_in_.tolist()
# Categorical features list after applying OneHotEncoder
all_cat_list = ct.named_transformers_['cat'].get_feature_names_out(original_cat_features).tolist()
# A for loop for original_cat_features to find the one hot encoded features corresponding to each original categorical feature
for original_cat_feature in original_cat_features:
# List of one hot encoded features corresponding to each original categorical feature
cat_list = [i for i in all_cat_list if i.startswith(original_cat_feature)]
# OneHotEncoded columns must be renamed.
# ct.named transformers['cat'].get_feature_names_out(original cat_features) returns column names missing "cat__" in front.
# Let's fix it easily!
for i, element in enumerate(cat_list):
cat_list[i] = 'cat__' + element
# Slice fi_df dataframe to return the only rows for the associated OneHotEncoded features names (cat_list)
# and then sum the feature importance values
cat_sum = fi_df[fi_df['feature_names'].isin(cat_list)]['feature_importance'].sum()
# Slice fi_df dataframe to return the only rows other than categorical features.
# In other words, dataframe with numerical features
fi_df = fi_df[~fi_df['feature_names'].isin(cat_list)]
# Create a temporary dictionary to return the originial categorical feature
# and the summation of OneHotEncoded features importances
temp_dict = {'feature_names': original_cat_feature, 'feature_importance': cat_sum}
# Append the temporary_dict to the dataframe
fi_df = fi_df.append(temp_dict, ignore_index=True)
# Sort the DataFrame in order decreasing feature importance
fi_df.sort_values(by=['feature_importance'], ascending=False, inplace=True)
# Getting rid of the prefixes of the feature names
prefixes = ('num__', 'cat__', 'remainder__', 'scaler__')
for prefix in prefixes:
fi_df['feature_names'] = fi_df['feature_names'].apply(lambda x: str(x).replace(prefix,""))
return fi_df

Resources