Doesn't Scikit learn need model initialization during looped training? - scikit-learn

While implementing K-fold using Scikit Learn in DecisionTreeClassifier model, I'm having hard time understanding why this baseline code doesn't contain any model initialization part. From my perspective, while fitting take place with iterations, the model which has already learned by first iteration stays the same(with identical parameter) during the second loop fitting and so on.
You can see my code below.
What I'm really curious about is, "Unlike other deep learning libraries like Pytorch etc, isn't there any need for model initialization for scikit-learn? or does this code below automatically do the initialization?(if so plz let me know where the parameter initialization take place)
model = DecisionTreeClassifier()
cv_accuracy = []
n_iter = 0
kfold = KFold(n_splits = 5, random_state = None, shuffle = False)
for train_index, validation_index in kfold.split(train_data, train_label):
x_train, x_val = train_data[train_index], train_data[validation_index]
y_train, y_val = train_label[train_index], train_label[validation_index]
train_size = x_train.shape[0]
val_size = x_val.shape[0]
model.fit(x_train, y_train)
pred = model.predict(x_val)
n_iter += 1
accuracy = np.round(accuracy_score(y_val, pred), 4)
cv_accuracy.append(accuracy)
# Thought I should initialize model somehow... in this part
model = DecisionTreeClassifier()
print('\n## Accuracy : ', np.mean(cv_accuracy))

fit() constructs a brand new tree behind the scenes (DecisionTreeClassifierObj.tree_), so it does that for you. The class init just provides the parameters it will use.
Here's the source code for that btw so you can see.
Simplified version of what fit() does:
#Process data
self.tree_ = Tree(self.n_features_, self.n_classes_, self.n_outputs_)
builder = BestFirstTreeBuilder(splitter, min_samples_split, min_samples_leaf,
min_weight_leaf, max_depth, max_leaf_nodes,
self.min_impurity_decrease, min_impurity_split)
builder.build(self.tree_, X, y, sample_weight)
self._prune_tree()
return self

Related

How to compute loss for every data point in Keras?

I use Tensorflow 2.0 and Keras to train a model. I do the following to load a pre-trained model which I then use for inference:
checkpoint_dir = "./"
x_test = np.random.normal(n_points, n_features)
model = tf.keras.models.load_model(checkpoint_dir)
predictions = model.predict(x_test)
I would like to know if I can get the loss for every data point as well? Is it possible to do something like
loss = model.compute_loss(x_test, y_test)
Just take a loss function from the backend and use it.
Example - if eager mode is on:
losses = tf.keras.backend.categorical_crossentropy(true_data, pred_data)
Example - if eager mode is off:
def loss_calc(x):
return backend.categorical_crossentropy(x[0], x[1])
trueIn = Input(shape_of_the_targets)
predIn = Input(shape_of_the_targets)
out = Lambda(loss_calc)([trueIn, predIn])
loss_model = Model([trueIn, predIn], out)
losses = loss_model.predict([true_data, pred_data])
You can evaluate the model using
model.evaluate(x_test, y_test)
Evaluate Returns the loss value & metrics values for the model in test mode. (https://keras.io/models/model/)

Kfold cross validation in python

What im trying to do;
Get the K-fold cross validated scores of an SVM. The data has all numerical independent variables, and a categorical dependent variable. Im using python3, sklearn and feature engine.
My understanding on the matter;
The independent variable has NA values, all of them are below 5% of the total data points, so i imputed them using the median values from the train set, as the variables are not normally distributed. I also scaled the values of the train and test set using the values from the test set. My train-test split is 80-20.
I understand that it is a good practice to scaled and impute data using only the train set. As this helps avoid over-fit and data leak.
When it comes to Kfold cross validation, the train and test set change.
Question;
Is there a way to ensure that i can re-impute and re-scale the train and test set based on the train set of each fold ?
Any help is appreciated, thank you !
Train-test split using a random seed. Same random seed is used in the K-Fold cross validation.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 3)
NA value imputation;
from feature_engine import missing_data_imputers as mdi
imputer = mdi.MeanMedianImputer(imputation_method = 'median')
imputer.fit(X_train)
X_train = imputer.transform(X_train)
Variable transformation;
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train_trans = scaler.transform(X_train)
X_test_trans = scaler.transform(X_test)
Below is the SVM;
def svm1(gam, C):
clf1 = svm.SVC(gamma=gam, C=C)
clf1.fit(X_train_trans, y_train)
print('The Trainset Score is {}.'.format(clf1.score(X_train_trans , y_train)))
print('The Testset Score is {}.'.format(clf1.score(X_test_trans , y_test)))
print('')
y_pred1 = clf1.predict(X_test_trans)
print('The confusin matrix is; \n{}'.format(metrics.confusion_matrix(y_test , y_pred1)))
interactive(svm1, gam = G1, C = cc1)
I then merge the train and test set, to get back a transformed dataset;
frames3 = [X_test_trans, X_train_trans ]
X_Final = pd.concat(frames3)
Now i fit the X_Final, which is concated train and test set, to get K-fold cross validated score.
kfold = KFold(n_splits = 10, random_state = 3)
model = svm.SVC(gamma=0.23, C=3.20)
results = cross_val_score(model, PCA_X_Final,y_Final, cv = kfold)
print(results)
print('Accuracy = {}%, Standard Deviation = {}%'.format(round(results.mean(), 4), round(results.std(), 2)))
I would like to know how i can re-scale and re-impute each fold, so that the variables are re-scaled, and NA values re-imputed in each fold using the train set to avoid overfit / dataleak
To impute and scale the data with the parameters derived from each fold in the CV, you first need to establish the engineering steps in a pipeline, and then do CV over the entire pipeline. For example something like this:
set up engineering pipeline:
my_pipe = Pipeline([
# missing data imputation
('imputer_num',
mdi.MeanMedianImputer(imputation_method='mean', variables=['varA', 'varB'])),
# scaler
('scaler', StandardScaler()),
# Gradient Boosted machine (or your SVM instead)
('gbm', GradientBoostingClassifier(random_state=0))
])
then the CV:
param_grid = {
# try different gradient boosted tree model parameters
'gbm__max_depth': [None, 1, 3],
}
# now we set up the grid search with cross-validation
grid_search = GridSearchCV(my_pipe, param_grid,
cv=5, n_jobs=-1, scoring='roc_auc')
More details in this notebook.

cifar10: implementing a model in keras but getting different accuracy than the article

I'm not sure if stackoverflow is the correct place to ask this but here it goes:
I've read an paper called:
Lets Keep it simple, Using simple architectures to outperform deeper and more complex architectures (2016)
According to this paper this architecture managed to achieve 94.75% accuracy
but my implementation can achieve so far at maximum around 82% accuracy!
So my questions are:
Regarding convolution layer stacking: is it ok to stack
conv->BatchNorm->Relu->(optional)max pooling or the order is
different?
What I'm doing wrong here?
Notice I've tried various dropout values, and no dropouts( the reason for that is that I can get 100% accuracy on training set but 80% on test set(half of the test set)) also I've tried less dense layers, playing with learning rate
(increasing and as you can see - decreasing),
any suggestions would be greatly appriciated!
Edit:
I've managed to improve it slightly with kernel weight initialization
(on each conv layer)
initializers.VarianceScaling(scale=1.0, mode='fan_in', distribution='normal')
but still it reach at most 86% accuracy on test set and if I introduce dropouts instead of increasing accuracy on test set(from my understanding dropouts should generalize the model better), still it's not the target accuracy that the article claims to achieve
Any help would be really appriciated!
my code:
def create_conv_block(X,filters = 64,kernel=[3,3],strides=[1,1],
repetition=1,withMaxPooling=True,
pool_kernel = [2,2], pool_strides = [2,2],
relualpha= 0,withDropOut=True,dropout_precent=0.5):
conv_layer = X
while(repetition > 0):
conv_layer = layers.Conv2D(filters=filters,
kernel_size=kernel,
strides=strides, padding='same')(conv_layer)
conv_layer = layers.BatchNormalization()(conv_layer)
conv_layer = layers.LeakyReLU(alpha=relualpha)(conv_layer)
if withMaxPooling:
try:
conv_layer = layers.MaxPooling2D(pool_size=pool_kernel,
strides=pool_strides)(conv_layer)
except:
conv_layer = layers.MaxPooling2D(pool_size=pool_kernel, strides=pool_strides, padding='same')(
conv_layer)
if withDropOut:
conv_layer = layers.Dropout(rate=dropout_precent)(conv_layer)
repetition -= 1
return conv_layer
def train(model_name):
#https://arxiv.org/pdf/1608.06037.pdf
global inputs, res
batch_size = 100
input_shape = (32, 32, 3)
inputs = layers.Input(shape=input_shape)
block1 = create_conv_block(inputs,withMaxPooling=False,withDropOut=True)
block2 = create_conv_block(block1,filters=128,repetition=3,withDropOut=True)
block3 = create_conv_block(block2,filters=128,repetition=2,withMaxPooling=False)
block4 = create_conv_block(block3,filters=128,withDropOut=False)
block5 = create_conv_block(block4,filters=128,repetition=2,withDropOut=True)
block6 = create_conv_block(block5, filters=128, withMaxPooling=False,withDropOut=True)
block7 = create_conv_block(block6, filters=128, withMaxPooling=False,kernel=[1,1],withDropOut=True)
block8 = create_conv_block(block7, filters=128,kernel=[1,1],withDropOut=False)
block9 = create_conv_block(block8, filters=128,withDropOut=True)
block9 = create_conv_block(block9, filters=128,withDropOut=False)
flatty = layers.Flatten()(block9)
dense1 = layers.Dense(128,activation=activations.relu)(flatty)
dense1 = layers.Dropout(0.5)(dense1)
dense1 = layers.Dense(512,activation=activations.relu)(dense1)
dense1 = layers.Dropout(0.2)(dense1)
dense2 = layers.Dense(512,activation=activations.relu)(dense1)
dense1 = layers.Dropout(0.5)(dense2)
dense2 = layers.Dense(512,activation=activations.relu)(dense1)
dense3 = layers.Dropout(0.2)(dense2)
res = layers.Dense(10, activation='softmax')(dense3)
model = models.Model(inputs=inputs, outputs=res)
opt = optimizers.Adam(lr=0.001)
model.compile(optimizer=opt, loss=losses.categorical_crossentropy, metrics=['accuracy'])
model.summary()
reduce_lr = keras.callbacks.ReduceLROnPlateau( factor=0.1, patience=5, min_lr=1e-10)
keras.utils.plot_model(model, to_file=model_name + '.png', show_shapes=True, show_layer_names=True)
model.fit(x=train_X, y=train_y, batch_size=batch_size, epochs=100,
validation_data=(test_X[:len(test_X) // 2], test_y[:len(test_X) // 2]),
callbacks=[reduce_lr])
model.save(model_name +'.h5')
return model
name = 'kis_convo_drop'
model = train(name)
It has official implementation on github, which you can see.
SimpleNet is the official original Caffe implementation, and SimpleNet in Pytorch is the official Pytorch implementation.
Apart from that, I noticed you are implementing a different architecture! your implementation is not the same as the one you are trying to implement.
You are using Dense layers, where as in the SimpleNet, there is
only convolutional layers and the only dense layer is the one for
classification.
You are using Leaky ReLU instead of ReLU.
You are using Adam optimizer, whereas they used Adadelta in their
implementation.

keras fit_generator reading chunks from hdfstore

I try to build a generator for a Keras model which will be trained on a large hdf store.
To speed up the training, I pre-calculated all features incl. one-hot encoding already in the hdfstore. So the call from that should be straight forward.
To feed chunks of my data into the network, I try to use fit_generator, but struggle to get it up and running.
The generator:
def myGenerator(myStore, generateFrom,generateTo):
# Create empty arrays to contain batch of features and labels#
while True:
X = pd.read_hdf(myStore,'X',start=generateFrom,stop=generateTo)
y = pd.read_hdf(myStore,'y',start=generateFrom,stop=generateTo)
yield X,y
Network and fitting:
def get_model(shape):
'''Create a keras model.'''
inputlayer = Input(shape=shape)
model = BatchNormalization()(inputlayer)
model = Dense(1024, activation='relu')(model)
model = Dropout(0.25)(model)
model = BatchNormalization()(inputlayer)
model = Dense(512, activation='relu')(model)
model = Dropout(0.25)(model)
model = BatchNormalization()(inputlayer)
model = Dense(256, activation='relu')(model)
model = Dropout(0.25)(model)
model = BatchNormalization()(inputlayer)
model = Dense(128, activation='relu')(model)
model = Dropout(0.25)(model)
# 11 because background noise has been taken out
model = Dense(2, activation='tanh')(model)
model = Model(inputs=inputlayer, outputs=model)
return model
shape = (6603,10000)
model = get_model(shape)
model.compile(loss='mean_squared_error', optimizer=Adam(), metrics=['accuracy'])
#X = generator(myStore)
#Xt = generator(myStore)
labelbinarizer = LabelBinarizer()
y = labelbinarizer.fit_transform(y)
#yt = labelbinarizer.fit_transform(yt)
generateFrom = 0
for i in range(10):
generateTo=generateFrom+10000
model.fit_generator(
generator=myGenerator(myStore,generateFrom,generateTo),
epochs=1,
steps_per_epoch=X[0].shape[0] // 1000)
generateFrom=generateTo
I have tried both, to have the fit_generator within a loop and plug in the range (as shown above), but also to handle the range inside the generator. Both does not work. Currently running into
TypeError: 'generator' object is not subscriptable
Likely I have some misunderstanding how fit_generator() is supposed to be used in this context. Most examples out there are around generating tensors from pictures.
Any hint is appreciated.
Thanks
The function read_hdf returns a panda object, you need to convert it to numpy array.

sklearn GridSearchCV: how to get classification report?

I am using GridSearchCV like this:
corpus = load_files('corpus')
with open('stopwords.txt', 'r') as f:
stop_words = [y for x in f.read().split('\n') for y in (x, x.title())]
x = corpus.data
y = corpus.target
pipeline = Pipeline([
('vec', CountVectorizer(stop_words=stop_words)),
('classifier', MultinomialNB())])
parameters = {'vec__ngram_range': [(1, 1), (1, 2)],
'classifier__alpha': [1e-2, 1e-3],
'classifier__fit_prior': [True, False]}
gs_clf = GridSearchCV(pipeline, parameters, n_jobs=-1, cv=5, scoring="f1", verbose=10)
gs_clf = gs_clf.fit(x, y)
joblib.dump(gs_clf.best_estimator_, 'MultinomialNB.pkl', compress=1)
Then, in another file, to classify new documents (not from the corpus), I do this:
classifier = joblib.load(filepath) # path to .pkl file
result = classifier.predict(tokenlist)
My question is: Where do I get the values needed for the classification_report?
In many other examples, I see people split the corpus into traing set and test set.
However, since I am using GridSearchCV with kfold-cross-validation, I don't need to do that.
So how can I get those values from GridSearchCV?
If you have GridSearchCV object:
from sklearn.metrics import classification_report
clf = GridSearchCV(....)
clf.fit(x_train, y_train)
classification_report(y_test,clf.best_estimator_.predict(x_test))
If you have saved the best estimator and loaded it then:
classifier = joblib.load(filepath)
classification_report(y_test,classifier.predict(x_test))
The best model is in clf.best_estimator_. You need to fit the training data to this; then predict your test data and use ytest and ypreds for the classification report.

Resources