Unable to Reproduce Results while using Scikit-learn RFECV - scikit-learn

I am trying to use Recursive Feature Elimination with CV and produce reproducible results. Even though I have tried fixing the randomness by random_state = SEED as arguments of the components used as well as tried setting the random seed globally as well using np.random.seed(SEED). However, I am unable to control for the randomness and am unable to reproduce results. Attached is the code segment.
estimator = GradientBoostingClassifier(random_state=SEED, n_estimators=2*df.shape[1])
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=SEED)
selector = RFECV(estimator, n_jobs=-1,step=STEP, cv=cv)
selector = selector.fit(df, y)
df = df.loc[:, selector.support_]
print("Shape of final data AFTER FEATURE SELECTION")
print(df.shape, y.shape)
Specifically, if I run this segment of code it returns different number of features selected at each run. Any help would be appreciated

Related

How to resolve KeyError: 'val_mean_absolute_error' Keras 2.3.1 and TensorFlow 2.0 From Chollet Deep Learning with Python

I am on section 3.7 of Chollet's book Deep Learning with Python.
The project is to find the median price of homes in a given Boston suburbs in the 1970's.
https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/3.7-predicting-house-prices.ipynb
At section "Validating our approach using K-fold validation" I try to run this block of code:
num_epochs = 500
all_mae_histories = []
for i in range(k):
print('processing fold #', i)
# Prepare the validation data: data from partition # k
val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]
val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]
# Prepare the training data: data from all other partitions
partial_train_data = np.concatenate(
[train_data[:i * num_val_samples],
train_data[(i + 1) * num_val_samples:]],
axis=0)
partial_train_targets = np.concatenate(
[train_targets[:i * num_val_samples],
train_targets[(i + 1) * num_val_samples:]],
axis=0)
# Build the Keras model (already compiled)
model = build_model()
# Train the model (in silent mode, verbose=0)
history = model.fit(partial_train_data, partial_train_targets,
validation_data=(val_data, val_targets),
epochs=num_epochs, batch_size=1, verbose=0)
mae_history = history.history['val_mean_absolute_error']
all_mae_histories.append(mae_history)
I get an error KeyError: 'val_mean_absolute_error'
mae_history = history.history['val_mean_absolute_error']
I am guessing the solution is figure out the correct parameter to replace val_mean_absolute_error. I've tried looking into some Keras documentation for what would be the correct key value. Anyone know the correct key value?
The problem in your code is that, when you compile your model, you do not add the specific 'mae' metric.
If you wanted to add the 'mae' metric in your code, you would need to do like this:
model.compile('sgd', metrics=[tf.keras.metrics.MeanAbsoluteError()])
model.compile('sgd', metrics=['mean_absolute_error'])
After this step, you can try to see if the correct name is val_mean_absolute_error or val_mae. Most likely, if you compile your model like I demonstrated in option 2, your code will work with "val_mean_absolute_error".
Also, you should also put the code snippet where you compile your model, it is missing in the question text from above(i.e. the build_model() function)
I replaced 'val_mean_absolute_error' with 'val_mae' and it worked for me
FYI, I had the same problem that persisted even after changing the line history.history['val_mae'] as described in the answer.
In my case, in order for the val_mae dict object to be present in history.history object, I needed to ensure that the model.fit() code included the 'validation_data = (val_data, val_targets)' argument. I neglected to do this initially.
I update it by below code line:
mae_history = history.history["mae"]
History object should contain the same names as what you compile.
For example:
mean_absolute_error gives val_mean_absolute_error
mae gives val_mae
accuracy gives val_accuracy
acc gives val_acc

Using Keras like TensorFlow for gpu computing

I would like to know if Keras can be used as an interface to TensoFlow for only doing computation on my GPU.
I tested TF directly on my GPU. But for ML purposes, I started using Keras, including the backend. I would find it 'comfortable' to do all my stuff in Keras instead of Using two tools.
This is also a matter of curiosity.
I found some examples like this one:
http://christopher5106.github.io/deep/learning/2018/10/28/understand-batch-matrix-multiplication.html
However this example does not actually do the calculation.
It also does not get input data.
I duplicate the snippet here:
'''
from keras import backend as K
a = K.ones((3,4))
b = K.ones((4,5))
c = K.dot(a, b)
print(c.shape)
'''
I would simply like to know if I can get the result numbers from this snippet above, and how?
Thanks,
Michel
Keras doesn't have an eager mode like Tensorflow, and it depends on models or functions with "placeholders" to receive and output data.
So, it's a little more complicated than Tensorflow to do basic calculations like this.
So, the most user friendly solution would be creating a dummy model with one Lambda layer. (And be careful with the first dimension that Keras will insist to understand as a batch dimension and require that input and output have the same batch size)
def your_function_here(inputs):
#if you have more than one tensor for the inputs, it's a list:
input1, input2, input3 = inputs
#if you don't have a batch, you should probably have a first dimension = 1 and get
input1 = input1[0]
#do your calculations here
#if you used the batch_size=1 workaround as above, add this dimension again:
output = K.expand_dims(output,0)
return output
Create your model:
inputs = Input(input_shape)
#maybe inputs2 ....
outputs = Lambda(your_function_here)(list_of_inputs)
#maybe outputs2
model = Model(inputs, outputs)
And use it to predict the result:
print(model.predict(input_data))

cross Validation in Sklearn using a Custom CV

I am dealing with a binary classification problem.
I have 2 lists of indexes listTrain and listTest, which are partitions of the training set (the actual test set will be used only later). I would like to use the samples associated with listTrain to estimate the parameters and the samples associated with listTest to evaluate the error in a cross validation process (hold out set approach).
However, I am not be able to find the correct way to pass this to the sklearn GridSearchCV.
The documentation says that I should create "An iterable yielding (train, test) splits as arrays of indices". However, I do not know how to create this.
grid_search = GridSearchCV(estimator = model, param_grid = param_grid,cv = custom_cv, n_jobs = -1, verbose = 0,scoring=errorType)
So, my question is how to create custom_cv based on these indexes to be used in this method?
X and y are respectivelly the features matrix and y is the vector of labels.
Example: Supose that I only have one hyperparameter alpha that belongs to the set{1,2,3}. I would like to set alpha=1, estimate the parameters of the model (for instance the coefficients os a regression) using the samples associated with listTrain and evaluate the error using the samples associated with listTest. Then I repeat the process for alpha=2 and finally for alpha=3. Then I choose the alpha that minimizes the error.
EDIT: Actual answer to question. Try passing cv command a generator of the indices:
def index_gen(listTrain, listTest):
yield listTrain, listTest
grid_search = GridSearchCV(estimator = model, param_grid =
param_grid,cv = index_gen(listTrain, listTest), n_jobs = -1,
verbose = 0,scoring=errorType)
EDIT: Before Edits:
As mentioned in the comment by desertnaut, what you are trying to do is bad ML practice, and you will end up with a biased estimate of the generalisation performance of the final model. Using the test set in the manner you're proposing will effectively leak test set information into the training stage, and give you an overestimate of the model's capability to classify unseen data. What I suggest in your case:
grid_search = GridSearchCV(estimator = model, param_grid = param_grid,cv = 5,
n_jobs = -1, verbose = 0,scoring=errorType)
grid_search.fit(x[listTrain], y[listTrain]
Now, your training set will be split into 5 (you can choose the number here) folds, trained using 4 of those folds on a specific set of hyperparameters, and tested the fold that was left out. This is repeated 5 times, till all of your training examples have been part of a left out set. This whole procedure is done for each hyperparameter setting you are testing (5x3 in this case)
grid_search.best_params_ will give you a dictionary of the parameters that performed the best over all 5 folds. These are the parameters that you use to train your final classifier, using again only the training set:
clf = LogisticRegression(**grid_search.best_params_).fit(x[listTrain],
y[listTrain])
Now, finally your classifier is tested on the test set and an unbiased estimate of the generalisation performance is given:
predictions = clf.predict(x[listTest])

Effective method for accumulating gradients in TensorFlow

It appears that there are already a couple questions on 'how to' accumulate gradients in TensorFlow. Here's the original and a duplicate.
The accepted recommendation, taken from this issue, is to do the following:
opt = tf.train.AdamOptimizer()
tvs = tf.trainable_variables()
accum_vars = [tf.Variable(tf.zeros_like(tv.initialized_value()), trainable=False) for tv in tvs]
zero_ops = [tv.assign(tf.zeros_like(tv)) for tv in accum_vars]
gvs = opt.compute_gradients(rmse, tvs)
accum_ops = [accum_vars[i].assign_add(gv[0]) for i, gv in enumerate(gvs)]
train_step = opt.apply_gradients([(accum_vars[i], gv[1]) for i, gv in enumerate(gvs)])
In the training loop we have:
while True:
sess.run(zero_ops)
for i in xrange(n_minibatches):
sess.run(accum_ops, feed_dict=dict(X: Xs[i], y: ys[i]))
sess.run(train_step)
I managed to implement a minimal example of this in a Jupyter notebook but I'm bothered by the ad-hoc nature of the solution. Moreover, as shown in the notebook, when training is run a second time the accumulator poses a problem. It's not clear to me right now how I should address this problem.
So I found the solution to my problem and posted the solution in a public gist. The key thing is to reset the default graph when compiling a new graph and running training for a second time in the same notebook.
So we have:
tf.reset_default_graph()
model = mnist_network(seed=42)

How do I correctly manually recreate sklearn (python) logistic regression predict_proba outcome for multiple classification

If I run a basic logistic regression with 4 classes, I can get the predict_proba array.
How can i manually calculate the probabilities using the coefficients and intercepts? What are the exact steps to get the same answers that predict_proba generates?
There seem to be multiple questions about this online and several suggestions which are either incomplete or don't match up anyway.
For example, I can't replicate this process from my sklearn model so what is missing?
https://stats.idre.ucla.edu/stata/code/manually-generate-predicted-probabilities-from-a-multinomial-logistic-regression-in-stata/
Thanks,
Because I had the same question but could not find an answer that gave the same results I had a look at the sklearn GitHub repository to find the answer. Using the functions from their own package I was able to create the same results I got from predict_proba().
It appears that sklearn uses a special softmax() function that differs from the usual softmax function in their code.
Let's assume you build a model like this:
from sklearn.linear_model import LogisticRegression
X = ...
Y = ...
model = LogisticRegression(multi_class="multinomial", solver="saga")
model.fit(X, Y)
Then you can calculate the probabilities either with model.predict(X) or use the sklearn function mentioned above to calculate them manually like this.
from sklearn.utils.extmath import softmax,
import numpy as np
scores = np.dot(X, model.coef_.T) + model.intercept_
softmax(scores) # Sklearn implementation
In the documentation for their own softmax() function, they note that
The softmax function is calculated by
np.exp(X) / np.sum(np.exp(X), axis=1)
This will cause overflow when large values are exponentiated. Hence
the largest value in each row is subtracted from each data point to
prevent this.
Replicate sklearn calcs (saw this on a different post):
V = X_train.values.dot(model.coef_.transpose())
U = V + model.intercept_
A = np.exp(U)
P=A/(1+A)
P /= P.sum(axis=1).reshape((-1, 1))
seems slightly different than softmax calcs, or the UCLA stat example, but it works.

Resources