Question about Permutation Importance on LSTM Keras - keras

from keras.wrappers.scikit_learn import KerasClassifier, KerasRegressor
import eli5
from eli5.sklearn import PermutationImportance
model = Sequential()
model.add(LSTM(units=30,return_sequences= True, input_shape=(X.shape[1],421)))
model.add(Dropout(rate=0.2))
model.add(LSTM(units=30, return_sequences=True))
model.add(LSTM(units=30))
model.add(Dense(units=1, activation='relu'))
perm = PermutationImportance(model, scoring='accuracy',random_state=1).fit(X, y, epochs=500, batch_size=8)
eli5.show_weights(perm, feature_names = X.columns.tolist())
I am running an LSTM just to see the feature importance of my dataset containing 400+ features. I used the Keras scikit-learn wrapper to use eli5's PermutationImportance function. But the code is returning
ValueError: Found array with dim 3. Estimator expected <= 2.
The code runs smoothly if I use model.fit() but can't debug the error of the permutation importance. Anyone know what is wrong?

eli5's scikitlearn implementation for determining permutation importance can only process 2d arrays while keras' LSTM layers require 3d arrays. This error is a known issue but there appears to be no solution yet.
I understand this does not really answer your question of getting eli5 to work with LSTM (because it currently can't), but I encountered the same problem and used another library called SHAP to get the feature importance of my LSTM model. Here is some of my code to help you get started:
import shap
DE = shap.DeepExplainer(model, X_train) # X_train is 3d numpy.ndarray
shap_values = DE.shap_values(X_validate_np, check_additivity=False) # X_validate is 3d numpy.ndarray
shap.initjs()
shap.summary_plot(
shap_values[0],
X_validate,
feature_names=list_of_your_columns_here,
max_display=50,
plot_type='bar')
Here is an example of the graph which you can get:
Hope this helps.

Related

oversampling (SMOTE) does not work properly when fitted inside a pipeline

I have an imbalanced classification problem and I am using make_pipeline from imblearn
So the steps are the following:
kf = StratifiedKFold(n_splits=10, random_state=42, shuffle=True)
params = {
'max_depth': [2,3,5],
# 'max_features':['auto', 'sqrt', 'log2'],
# 'min_samples_leaf': [5,10,20,50,100,200,300],
'n_estimators': [10,25,30,50]
# 'bootstrap': [True, False]
}
from imblearn.pipeline import make_pipeline
imba_pipeline = make_pipeline(SMOTE(random_state = 42), RobustScaler(), RandomForestClassifier(random_state=42))
imba_pipeline
out:Pipeline(steps=[('smote', SMOTE(random_state=42)),
('robustscaler', RobustScaler()),
('randomforestclassifier',
RandomForestClassifier(random_state=42))])
new_params = {'randomforestclassifier__' + key: params[key] for key in params}
grid_imba = GridSearchCV(imba_pipeline, param_grid=new_params, cv=kf, scoring='recall',
return_train_score=True, n_jobs=-1, verbose=2)
grid_imba.fit(X_train, y_train)
And everything is going ok and I am reaching to the end to by problem (i.e I can see the classification report)
However when I am trying to see inside the black box with eli5 with eli.explain_weights(imba_pipeline)
I get back as error
TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'SMOTE(random_state=42)' (type <class 'imblearn.over_sampling._smote.SMOTE'>) doesn't
I know that this Is a common problem and i have read the related questions but i am confused as the problem is occurred after the end of my classification procedure
Any suggestions?
Your pipeline has two fitted steps (+ the scaler): the SMOTE augmentation and the random forest. It looks like this is confusing the eli5 which wants to work with the assumptions that only the last layer is fitted. To get the weight explanation of the random forest you could try calling eli5 only on that layer of the pipeline with
from eli5 import explain_weights
explain_weights(imba_pipeline['randomforestclassifier'])
provided the pipeline is fitted, but in your code you were fitting the grid search so
explain_weights(grid_imba.best_estimator_['randomforestclassifier'])
would be more appropriate.
Just wanted to point out that SMOTE generally doesn't improve prediction quality. See https://arxiv.org/abs/2201.08528

model.predict in keras using universal sentence encoder giving shape error

I am using keras model.predict to predict sentiments. I am using universal sentence embeddings. While predicting, I am getting the error described below.
Please provide your valuable insights.
Regards.
I have run the code for two sets of inputs. For say, input1, the prediction is obtained. While its not working for input 2.
Input 1 is the form : {(a1,[sents1]),....}
Input 2:{((a1,a2),[sents11])),...}
The input for predicting is the [sents1], [sents11] etc. extracted from this.
I could see the related question in (Keras model.predict function giving input shape error). But I don't know whether its resolved. Further, input1 is working.
import tensorflow as tf
import keras.backend as K
from keras import layers
from keras.models import Model
import numpy as np
def UniversalEmbedding(x):
return embed(tf.squeeze(tf.cast(x, tf.string)), signature="default", as_dict=True)["default"]
input_text = layers.Input(shape=(1,), dtype=tf.string)
embedding = layers.Lambda(UniversalEmbedding, output_shape=(embed_size,))(input_text)
dense = layers.Dense(256, activation='relu')(embedding)
pred = layers.Dense(category_counts, activation='softmax')(dense)
model = Model(inputs=[input_text], outputs=pred)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
sents1=list(input2.items())
with tf.Session() as session:
K.set_session(session)
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
# model.load_weights(.//)
for i,ch in enumerate(sents1):
new_text=ch[1]
if len(new_text)>1:
new_text = np.array(new_text, dtype=object)[:, np.newaxis]
predicts = model.predict(new_text, batch_size=32)
InvalidArgumentError: input must be a vector, got shape: [] [[{{node
lambda_2/module_1_apply_default/tokenize/StringSplit}} =
StringSplit[skip_empty=true,
_device="/job:localhost/replica:0/task:0/device:CPU:0"](lambda_2/module_1_apply_default/RegexReplace_1,
lambda_2/module_1_apply_default/tokenize/Const)]]
Try removing trailing blanks at the start of the sentence.
new_text.strip()
USE preprocessed sentences by splitting on space, creating some empty lists from trailing spaces, which cannot be embedded.
(Hope this answer is not too late)
Also could be some missing values in sentences, without text. Need to exclude these.

keras error when using custom loss

I was to use a simple BiLSTM model with my own custom loss function in Keras.
See below.
model = Sequential()
model.add(Bidirectional(LSTM(128, return_sequences=True), input_shape=(1,8)))
model.add(Bidirectional(LSTM(128)))
model.add(Dense(64, activation='relu'))
model.add(Dense(20, activation='softmax'))
def my_loss_np(y_true, y_pred):
labels = [np.argmax(y_pred[i]) for i in range(y_pred.shape[1])]
loss = np.mean(labels)
return loss
import keras.backend as K
def my_loss(y_true, y_pred):
loss = K.eval(my_loss_np(K.eval(y_true), K.eval(y_pred)))
return loss
When I compile this model, I get an error -
model.compile(loss=my_loss, optimizer='adam')
InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'dense_95_target' with dtype float and shape [?,?]
[[Node: dense_95_target = Placeholder[dtype=DT_FLOAT, shape=[?,?], _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
There are several issues here with your loss function:
You are using NumPy on tensors, unfortunately though it is an intuitive this doesn't work. You need to use tensor operators from the Keras backend, they are very similar.
To that end you are calling K.eval but at this stage you are still constructing a symbolic computation graph which will be run in TensorFlow or Theano. So the tensors don't have a value to compute per say, you need to keep it symbolic, you can get any values like you do in NumPy.
Even if you fix the problems above, you are using a non-differentiable operation argmax which will not work with gradient descent algorithms.
Your model looks like a multi-label classification problem, 20 classes as your final layer is 20 with softmax. In this case, the literature uses categorical-crossentropy loss to train the classifier network.

Keras Conv1D: error of dimensions

I am trying to perform a rating using the CNN template.
I have 150 classes. My train base has 19470 rows and 1945 columns. It is an matrix that contains 0 and 1.
import keras
from keras.models import Sequential
from keras.layers import Conv1D
from keras.layers.advanced_activations import LeakyReLU
model = Sequential()
model.add(Conv1D(150,kernel_size=3,input_shape(19470,1945),activation='linear',padding='same'))
model.add(LeakyReLU(alpha=0.1))
model.compile(loss=keras.losses.categorical_crossentropy, optimizer=keras.optimizers.Adam(),metrics=['accuracy'])
model.fit(x_train, y_train)
This raises:
ValueError: Error when checking input: expected conv1d_39_input to have 3 dimensions, but got array with shape (19470, 1945)
Did you check your xtrain shape?
According to the error that keras is raising you should do: x_train = xtrain.reshape(19470, 1945, 1)
I don't understand why are you using as many layers of conv1d as classes you have?
I can't give advice on the architecture of your NN, but I your last layer should be a Dense layer with 150 units and softmax activation. Don't you have 150 classes?

sample_weight parameter shape error in scikit-learn GridSearchCV

Passing the sample_weight parameter to GridSearchCV raises an error due to incorrect shape. My suspicion is that cross validation is not capable of handling the split of sample_weights accordingly with the dataset.
First part: Using sample_weight as a model parameter works beautifully
Let's consider a simple example, first without GridSearch:
import pandas as pd
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
dataURL = 'https://raw.githubusercontent.com/mcasl/PAELLA/master/data/sinusoidal_data.csv'
x = pd.read_csv(dataURL, usecols=["x"]).x
y = pd.read_csv(dataURL, usecols=["y"]).y
occurrences = pd.read_csv(dataURL, usecols=["Occurrences"]).Occurrences
my_sample_weights = (1 - occurrences/10000)**3
my_sample_weights contains the importance that I assign to each observation in x, y, as the following picture shows. The points of the sinusoidal curve get higher weights than those forming the background noise.
plt.scatter(x, y, c=my_sample_weights>0.9, cmap="cool")
Let's train a neural network, first without using the information contained in my_sample_weights:
def make_model(number_of_hidden_neurons=1):
model = Sequential()
model.add(Dense(number_of_hidden_neurons, input_shape=(1,), activation='tanh'))
model.add(Dense(1, activation='linear'))
model.compile(optimizer='sgd', loss='mse')
return model
net_Not_using_sample_weight = make_model(number_of_hidden_neurons=6)
net_Not_using_sample_weight.fit(x,y, epochs=1000)
plt.scatter(x, y, )
plt.scatter(x, net_Not_using_sample_weight.predict(x), c="green")
As the following picture shows, the neural network tries to fit the shape of the sinusoidal but the background noise prevents it from a good fit.
Now, using the information of my_sample_weights , the quality of the prediction is a much better one.
Second part: Using sample_weight as a GridSearchCV parameter raises an error
my_Regressor = KerasRegressor(make_model)
validator = GridSearchCV(my_Regressor,
param_grid={'number_of_hidden_neurons': range(4, 5),
'epochs': [500],
},
fit_params={'sample_weight': [ my_sample_weights ]},
n_jobs=1,
)
validator.fit(x, y)
Trying to pass the sample_weights as a parameter gives the following error:
...
ValueError: Found a sample_weight array with shape (1000,) for an input with shape (666, 1). sample_weight cannot be broadcast.
It seems that the sample_weight vector has not been split in a similar manner to the input array.
For what is worth:
import sklearn
print(sklearn.__version__)
0.18.1
import keras
print(keras.__version__)
2.0.5
The problem is that as a standard, the GridSearch uses 3-fold cross-validation, unless explicity stated otherwise. This means that 2/3 data points of the data are used as training data and 1/3 for cross-validation, which does fit the error message. The input shape of 1000 of the fit_params doesn't match the number of training examples used for training (666). Adjust the size and the code will run.
my_sample_weights = np.random.uniform(size=666)
We developed PipeGraph, an extension to Scikit-Learn Pipeline that allows you to get intermediate data, build graph like workflows, and in particular, solve this problem (see the examples in the gallery at http://mcasl.github.io/PipeGraph )

Resources