Is it possible to get back the list in stratifiedKFold? - scikit-learn

I'd like to do something like this :
Skf = sklearn.model_selection.StratifiedKFold(n_splits = 5, shuffle = True)
ALPHA,BETA = Skf.split(data_X, data_Y)
and then :
for train_index, test_index in ALPHA,BETA
However, it isn't working, why and how to bypass that problem ?
My idea is that I want to use the same split a few times at different part of my code... I don't know how to "stock" the split.

Yes, you can. You can specify the seed used by the random number generator, so that you obtain the same split over different runs. Just specify the random_state parameter!
SEED = 42
Skf = sklearn.model_selection.StratifiedKFold(n_splits=5,
shuffle=True,
random_state=SEED)

Related

How to get feature importances/feature ranking from summary plot in SHAP without crashing?

I am attempting to get shap values out of an array which was created by
explainer = shap.Explainer(xg_clf, X_train)
shap_values2 = explainer(X_train)
using my XGBoost data, to make a dataframe of feature_names and their SHAP importance, as they would appear in a SHAP bar or summary plot.
Following advice from how to extract the most important feature names? and How to get feature names of shap_values from TreeExplainer? specifically the comment by user Thoo, which shows how the values can be extracted to make a dataframe:
vals= np.abs(shap_values).mean(0)
feature_importance = pd.DataFrame(list(zip(X_train.columns,vals)),columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'],ascending=False,inplace=True)
feature_importance.head()
shap_values has 11595 persons with 595 features each, which I understand is large, but, creating the vals variable runs very slowly, about 58 minutes on my laptop. It uses almost all RAM on the computer.
After 58 minutes I get an error:
Command terminated by signal 9
which as far as I understand, means that the computer ran out of RAM.
I've tried converting the 2nd line in Thoo's code to
feature_importance = pd.DataFrame(list(zip(X_train.columns,np.abs(shap_values2).mean(0))),columns=['col_name','feature_importance_vals'])
so that vals isn't stored but this change doesn't reduce RAM at all.
I've also tried a different comment from the same GitHub issue (user "ba1mn"):
def global_shap_importance(model, X):
""" Return a dataframe containing the features sorted by Shap importance
Parameters
----------
model : The tree-based model
X : pd.Dataframe
training set/test set/the whole dataset ... (without the label)
Returns
-------
pd.Dataframe
A dataframe containing the features sorted by Shap importance
"""
explainer = shap.Explainer(model)
shap_values = explainer(X)
cohorts = {"": shap_values}
cohort_labels = list(cohorts.keys())
cohort_exps = list(cohorts.values())
for i in range(len(cohort_exps)):
if len(cohort_exps[i].shape) == 2:
cohort_exps[i] = cohort_exps[i].abs.mean(0)
features = cohort_exps[0].data
feature_names = cohort_exps[0].feature_names
values = np.array([cohort_exps[i].values for i in range(len(cohort_exps))])
feature_importance = pd.DataFrame(
list(zip(feature_names, sum(values))), columns=['features', 'importance'])
feature_importance.sort_values(
by=['importance'], ascending=False, inplace=True)
return feature_importance
but global_shap_importance returns the feature importances in the wrong order, and I don't see how I can alter global_shap_importance so that the features are returned in the same order as summary_plot (beeswarm plot).
How can I get the feature importance ranking into a dataframe?
I pulled this straight from the source code. Confirmed identical to the summary_plot.
def shapley_feature_ranking(shap_values, X):
feature_order = np.argsort(np.mean(np.abs(shap_values), axis=0))
return pd.DataFrame(
{
"features": [X.columns[i] for i in feature_order][::-1],
"importance": [
np.mean(np.abs(shap_values), axis=0)[i] for i in feature_order
][::-1],
}
)
shapley_feature_ranking(shap_values[0], X)

How to get a specific sample from pytorch DataLoader?

In Pytorch, is there any way of loading a specific single sample using the torch.utils.data.DataLoader class? I'd like to do some testing with it.
The tutorial uses
trainloader = torch.utils.data.DataLoader(...)
images, labels = next(iter(trainloader))
to fetch a random batch of samples. Is there are way, using DataLoader, to get a specific sample?
Cheers
Turn off the shuffle in DataLoader
Use batch_size to calculate the batch in which the desired sample you are looking for falls in
Iterate to the desired batch
Code
import torch
import numpy as np
import itertools
X= np.arange(100)
batch_size = 2
dataloader = torch.utils.data.DataLoader(X, batch_size=batch_size, shuffle=False)
sample_at = 5
k = int(np.floor(sample_at/batch_size))
my_sample = next(itertools.islice(dataloader, k, None))
print (my_sample)
Output:
tensor([4, 5])
if you want to get a specific signle sample from your dataset you can
you should check Subset class.(https://pytorch.org/docs/stable/data.html#torch.utils.data.Subset)
something like this:
indices = [0,1,2] # select your indices here as a list
subset = torch.utils.data.Subset(train_set, indices)
trainloader = DataLoader(subset , batch_size = 16 , shuffle =False) #set shuffle to False
for image , label in trainloader:
print(image.size() , '\t' , label.size())
print(image[0], '\t' , label[0]) # index the specific sample
here is a useful link if you want to learn more about the Pytorch data loading utility
(https://pytorch.org/docs/stable/data.html)

ValueError: unpack: when trying to split fashion_mnist into 3 splits

(train_dataset,validation_dataset,test_dataset) = tfds.load('fashion_mnist',
with_info=True, as_supervised=True,
split=['train[:80%]', 'train[80%:90%]', 'train[90%:]'])
I am trying to split the fashion_mnist into 3 sets-train test and validation. I'm not sure what the error is here as i am simply not able to resolve it.
The "fashion_mnist" dataset only has a train and a test split in Tensorflow Datasets (see documentation, Splits section), so in the split paramter it expects a list that has length at most 2, however you are using a list of length 3. In order to get a train, validation and test split, you could do the following:
whole_ds,info_ds = tfds.load("fashion_mnist", with_info = True, split='train+test', as_supervised=True)
n = tf.data.experimental.cardinality(whole_ds).numpy() # 70 000
train_num = int(n*0.8)
val_num = int(n*0.1)
train_ds = whole_ds.take(train_num)
val_ds = whole_ds.skip(train_num).take(val_num)
test_ds = whole_ds.skip(train_num+val_num)
If you want to retain the provided test data as your test data:
(train_data, validation_data, test_data),info = tfds.load(
name="fashion_mnist",
split=['train[:80%]', 'train[80%:]', 'test'],
as_supervised=True,
with_info=True)

for-loop issue (how to use i-parameter on the obj name created within the for-loop)

i'd like to create few Random forest model within a for
loop in which i move the number of estimators. Train each of them on the same data sample and measure the accuracy of each.
This is my beginning code:
r = range(0, 100)
for i in r:
RF_model_%i = RandomForestClassifier(criterion="entropy", n_estimators=i, oob_score=True)
RF_model_%i.fit(X_train, y_train)
y_predict = RF_model_%i.predict(X_test)
accuracy_%i = accuracy_score(y_test, y_predict)
what i like to understand is:
how can i put the i-parameter on the name of each model (in
order to recognise them)?
after tained and calculated the
accuracy score each of the i-models how can i put the result on
a list within the for-loop?
You can do something like this:
results = [] # init
r = range(0, 100)
for i in r:
RF_model_%i = RandomForestClassifier(criterion="entropy", n_estimators=i, oob_score=True)
RF_model_%i.id = i # dynamically add fields to objects
RF_model_%i.fit(X_train, y_train)
y_predict = RF_model_%i.predict(X_test)
accuracy_%i = accuracy_score(y_test, y_predict)
results.append(accuracy_%i) # put the result on a list within the for-loop

python feature selection in pipeline: how determine feature names?

i used pipeline and grid_search to select the best parameters and then used these parameters to fit the best pipeline ('best_pipe'). However since the feature_selection (SelectKBest) is in the pipeline there has been no fit applied to SelectKBest.
I need to know the feature names of the 'k' selected features. Any ideas how to retrieve them? Thank you in advance
from sklearn import (cross_validation, feature_selection, pipeline,
preprocessing, linear_model, grid_search)
folds = 5
split = cross_validation.StratifiedKFold(target, n_folds=folds, shuffle = False, random_state = 0)
scores = []
for k, (train, test) in enumerate(split):
X_train, X_test, y_train, y_test = X.ix[train], X.ix[test], y.ix[train], y.ix[test]
top_feat = feature_selection.SelectKBest()
pipe = pipeline.Pipeline([('scaler', preprocessing.StandardScaler()),
('feat', top_feat),
('clf', linear_model.LogisticRegression())])
K = [40, 60, 80, 100]
C = [1.0, 0.1, 0.01, 0.001, 0.0001, 0.00001]
penalty = ['l1', 'l2']
param_grid = [{'feat__k': K,
'clf__C': C,
'clf__penalty': penalty}]
scoring = 'precision'
gs = grid_search.GridSearchCV(estimator=pipe, param_grid = param_grid, scoring = scoring)
gs.fit(X_train, y_train)
best_score = gs.best_score_
scores.append(best_score)
print "Fold: {} {} {:.4f}".format(k+1, scoring, best_score)
print gs.best_params_
best_pipe = pipeline.Pipeline([('scale', preprocessing.StandardScaler()),
('feat', feature_selection.SelectKBest(k=80)),
('clf', linear_model.LogisticRegression(C=.0001, penalty='l2'))])
best_pipe.fit(X_train, y_train)
best_pipe.predict(X_test)
You can access the feature selector by name in best_pipe:
features = best_pipe.named_steps['feat']
Then you can call transform() on an index array to get the names of the selected columns:
X.columns[features.transform(np.arange(len(X.columns)))]
The output here will be the eighty column names selected in the pipeline.
Jake's answer totally works. But depending on what feature selector your using, there's another option that I think is cleaner. This one worked for me:
X.columns[features.get_support()]
It gave me an identical answer to Jake's answer. And you can see more about it in the docs, but get_support returns an array of true/false values for whether or not the column was used. Also, it's worth noting that X must be of identical shape to the training data used on the feature selector.
This could be an instructive alternative: I encountered a similar need as what was asked by OP. If one wants to get the k best features' indices directly from GridSearchCV:
finalFeatureIndices = gs.best_estimator_.named_steps["feat"].get_support(indices=True)
And via index manipulation, can get your finalFeatureList:
finalFeatureList = [initialFeatureList[i] for i in finalFeatureIndices]

Resources