Reinstantiating model at each step of cross validation using KFold - scikit-learn

I have a little perplexity with a simple cross validation process using KFold :
for train_idx, val_idx in kf.split(X_train_scaled, y_train):
model.fit(X_train_scaled.iloc[train_idx], y_train.iloc[train_idx])
preds = model.predict(X_train_scaled.iloc[val_idx])
I have the feeling that at each iteration I should reinstantiate the model by calling, e.g. model = RandomForestRegressor() in order to train from new... or can I leave it like this?

Related

Doesn't Scikit learn need model initialization during looped training?

While implementing K-fold using Scikit Learn in DecisionTreeClassifier model, I'm having hard time understanding why this baseline code doesn't contain any model initialization part. From my perspective, while fitting take place with iterations, the model which has already learned by first iteration stays the same(with identical parameter) during the second loop fitting and so on.
You can see my code below.
What I'm really curious about is, "Unlike other deep learning libraries like Pytorch etc, isn't there any need for model initialization for scikit-learn? or does this code below automatically do the initialization?(if so plz let me know where the parameter initialization take place)
model = DecisionTreeClassifier()
cv_accuracy = []
n_iter = 0
kfold = KFold(n_splits = 5, random_state = None, shuffle = False)
for train_index, validation_index in kfold.split(train_data, train_label):
x_train, x_val = train_data[train_index], train_data[validation_index]
y_train, y_val = train_label[train_index], train_label[validation_index]
train_size = x_train.shape[0]
val_size = x_val.shape[0]
model.fit(x_train, y_train)
pred = model.predict(x_val)
n_iter += 1
accuracy = np.round(accuracy_score(y_val, pred), 4)
cv_accuracy.append(accuracy)
# Thought I should initialize model somehow... in this part
model = DecisionTreeClassifier()
print('\n## Accuracy : ', np.mean(cv_accuracy))
fit() constructs a brand new tree behind the scenes (DecisionTreeClassifierObj.tree_), so it does that for you. The class init just provides the parameters it will use.
Here's the source code for that btw so you can see.
Simplified version of what fit() does:
#Process data
self.tree_ = Tree(self.n_features_, self.n_classes_, self.n_outputs_)
builder = BestFirstTreeBuilder(splitter, min_samples_split, min_samples_leaf,
min_weight_leaf, max_depth, max_leaf_nodes,
self.min_impurity_decrease, min_impurity_split)
builder.build(self.tree_, X, y, sample_weight)
self._prune_tree()
return self

How to compute loss for every data point in Keras?

I use Tensorflow 2.0 and Keras to train a model. I do the following to load a pre-trained model which I then use for inference:
checkpoint_dir = "./"
x_test = np.random.normal(n_points, n_features)
model = tf.keras.models.load_model(checkpoint_dir)
predictions = model.predict(x_test)
I would like to know if I can get the loss for every data point as well? Is it possible to do something like
loss = model.compute_loss(x_test, y_test)
Just take a loss function from the backend and use it.
Example - if eager mode is on:
losses = tf.keras.backend.categorical_crossentropy(true_data, pred_data)
Example - if eager mode is off:
def loss_calc(x):
return backend.categorical_crossentropy(x[0], x[1])
trueIn = Input(shape_of_the_targets)
predIn = Input(shape_of_the_targets)
out = Lambda(loss_calc)([trueIn, predIn])
loss_model = Model([trueIn, predIn], out)
losses = loss_model.predict([true_data, pred_data])
You can evaluate the model using
model.evaluate(x_test, y_test)
Evaluate Returns the loss value & metrics values for the model in test mode. (https://keras.io/models/model/)

Implementing Tensorflow Regression Model on Basketball data

I am following along the following guide to tensorflow regression models: https://www.tensorflow.org/tutorials/keras/basic_regression
Using basketball data. I am wanting to predict NBA career length based on college stats. I currently have normalized data in the format:
I then build the following model based on the code in the above link:
def build_model():
model = keras.Sequential([
keras.layers.Dense(64, activation=tf.nn.relu,
input_shape=(train.shape[1],)),
keras.layers.Dense(64, activation=tf.nn.relu),
keras.layers.Dense(1)
])
optimizer = tf.train.RMSPropOptimizer(0.001)
model.compile(loss='mse',
optimizer=optimizer,
metrics=['mae'])
return model
model = build_model()
model.summary()
Which appears to work fine. However when I then try to run the model and record the history using the following code:
EPOCHS = 200
labels = ['Age','G','FG','FGA','X3P','X3PA','FTA','TRB','AST','STL','BLK','Wt','final_ht','colyears','nbayears']
# Store training stats
history = model.fit(train, labels, epochs=EPOCHS, validation_split=0.2, verbose=0)
This gives me an error that: 'str' object has no attribute 'ndim', which I am having trouble understanding what it means. Am I doing something wrong?
When you call the .fit function of the model the second parameter should represent your target variable (NBA career length). This will be a one-dimensional array instead of the list you tried to pass to the function.
This should solve the problem.

Cross-validation in sklearn: do I need to call fit() as well as cross_val_score()?

I would like to use k-fold cross validation while learning a model. So far I am doing it like this:
# splitting dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(dataset_1, df1['label'], test_size=0.25, random_state=4222)
# learning a model
model = MultinomialNB()
model.fit(X_train, y_train)
scores = cross_val_score(model, X_train, y_train, cv=5)
At this step I am not quite sure whether I should use model.fit() or not, because in the official documentation of sklearn they do not fit but just call cross_val_score as following (they do not even split the data into training and test sets):
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
I would like to tune the hyper parameters of the model while learning the model. What is the right pipeline?
If you want to do hyperparameter selection then look into RandomizedSearchCV or GridSearchCV. If you want to use the best model afterwards, then call either of these with refit=True and then use best_estimator_.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
log_params = {'penalty': ['l1', 'l2'], 'C': [1E-7, 1E-6, 1E-6, 1E-4, 1E-3]}
clf = LogisticRegression()
search = RandomizedSearchCV(clf, scoring='average_precision', cv=10,
n_iter=10, param_distributions=log_params,
refit=True, n_jobs=-1)
search.fit(X_train, y_train)
clf = search.best_estimator_
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
Your second example is right for doing the cross validation. See the example here: http://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics
The fitting will be done inside the cross_val_score function, you don't need to worry about this beforehand.
[Edited] If, besides cross validation, you want to train a model, you can call model.fit() afterwards.

Keras Model Accuracy differs after loading the same saved model

I trained a Keras Sequential Model and Loaded the same later. Both the model are giving different accuracy.
I have came across a similar question but was not able solve the problem.
Sample Code :
Loading and Traing the model
model = gensim.models.FastText.load('abc.simple')
X,y = load_data()
Vectors = np.array(vectors(X))
X_train, X_test, y_train, y_test = train_test_split(Vectors, np.array(y),
test_size = 0.3, random_state = 0)
X_train = X_train.reshape(X_train.shape[0],100,max_tokens,1)
X_test = X_test.reshape(X_test.shape[0],100,max_tokens,1)
data for input to our model
print(X_train.shape)
model2 = train()
score = model2.evaluate(X_test, y_test, verbose=0)
print(score)
Training Accuracy is 90%.
Saved the Model
# Saving Model
model_json = model2.to_json()
with open("model_architecture.json", "w") as json_file:
json_file.write(model_json)
model2.save_weights("model_weights.h5")
print("Saved model to disk")
But after I restarted the kernel and just loaded the saved model and runned it on same set of data, accuracy got reduced.
#load json and create model
json_file = open('model_architecture.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
#load weights into new model
loaded_model.load_weights("model_weights.h5")
print("Loaded model from disk")
# evaluate loaded model on test data
loaded_model.compile(loss='binary_crossentropy', optimizer='rmsprop',
metrics=['accuracy'])
score = loaded_model.evaluate(X_test, y_test, verbose=0)
print(score)
Accuracy got reduced to 75% on the same set of data.
How to make it consistent ?
I have tried the following but of no help :
from keras.backend import manual_variable_initialization
manual_variable_initialization(True)
Even , I saved the whole model at once( weights and architecture) but was not able to solve this issue
Not sure, if your issue has been solved but for future comers.
I had exactly the same problem with saving and loading the weights. So on loading the model the accuracy and loss were changed greatly from 68% accuracy to 2 %. In my experiment, I am using Tensorflow as backend with Keras model layers Embedding, LSTM and Dense. My issue got solved by fixing the seed for keras which uses NumPy random generator and since I am using Tensorflow as backend, I also fixed the seed for it.
These are the lines I added at the top of my file where the model is also defined.
from numpy.random import seed
seed(42)# keras seed fixing
import tensorflow as tf
tf.random.set_seed(42)# tensorflow seed fixing
I hope this helps.
For more information have a look at this- https://machinelearningmastery.com/reproducible-results-neural-networks-keras/
I had the same problem due to a silly mistake of mine - after loading the model I had in my data generator the shuffle option (useful for the training) turned to True instead of False. After changing it to False the model predicted as expected. It would be nice if keras could take care of this automatically. This is my critical code part:
pred_generator = pred_datagen.flow_from_directory(
directory='./ims_dir',
target_size=(100, 100),
color_mode="rgb",
batch_size=1,
class_mode="categorical",
shuffle=False,
)
model = load_model(logpath_ms)
pred=model.predict_generator(pred_generator, steps = N, verbose=1)
My code worked when I scaled my dataset before reevaluating the model. I did this treatment before saving the model and had forgotten to repeat this procedure when I opened the model and wanted to evaluate it again. After I did that, the accuracy value appeared as it should \o/
model_saved = keras.models.load_model('tuned_cnn_1D_HAR_example.h5')
trainX, trainy, testX, testy = load_dataset()
trainX, testX = scale_data(trainX, testX, True)
score = model_saved.evaluate(testX, testy, verbose=0)
print("%s: %.2f%%" % (model_saved.metrics_names[1], score[1]*100))
inside of my function scale_data I used StandardScaler()

Resources