RandomForestClassifier crashes on training

RandomForestClassifier crashes on training - python-3.x

I am using scikit.learn RandomForestClassifier to generate a binary classifier, whenever I try fitting the model instance with training dataset, the kernel crashes after 3 or 4 seconds, I read a stackoverflow answer (Link: Jupyter Notebook and Colab Keep Crashing From Running Random Forest Model) to use pruning methods, but they don't seem to work.
The code for the Classifier is as follows -
# implementing RF
from sklearn.ensemble import RandomForestClassifier
# Instantiating rf model
rf_model = RandomForestClassifier(n_estimators = 10, random_state = 42, max_depth=10, max_leaf_nodes=10, max_features=None)
# Fitting the model
rf_model.fit(train_features, train_labels.ravel())
The shape of the training and testing dataset are as follows -
Training Features shape: (224553, 54)
Training Labels shape: (224553, 1)
Testing Features shape:(74852, 54)
Testing Labels shape: (74852, 1)
I have tried various methods but can't seem to fit the dataset, neither on my localmachine nor on Google collab, My machine stats are -
RAM 16gb
intel i7
Nvidia quadro 4gb graphics card
It would be great if you could help me with it. Thank you in advance.

Related

Memory Leakage tflite models

I have a tensorsorflow lite model (keras) in a flutter application that segments objects in an image (model is trained only on one specific object). With this model I can make 50-60 predictions without problems. Now i used the exact same model to train for another classification, when i use these two in a sequence it crashes after 3 images because of a memory leakage. For this task i have to have separate models, because in the future i might want to add/remove/switch just some of the models.
Does anyone have experience with keras.tenorsflow and memory leakage when tflite models are run sequentially?
I have done:
both models to tflite
include optimizers as false
compile as false
What i can not figure out:
Why one model can make 60 predictions in a row but not more than 3 when 2 models are run
How i can clear memory in flutter application
Where in model.predict() memory leakage occur
And i tried to minimize the size of the models through code below but that did not solve the issue.
import tensorflow as tf
model = tf.keras.models.load_model("model")
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
converter.target_spec.supported_types = [tf.float16]
tflite_quant_model = converter.convert()
#save converted quantization model to tflite format
open("model.tflite", "wb").write(tflite_quant_model)

how to correctly shape input of a multiclass classification using keras stacked LSTM model

I am working on a multiple classification problem and after dabbling with multiple neural network architectures, I settled for a stacked LSTM structure as it yields the best accuracy for my use-case. Unfortunately the network takes a long time (almost 48 hours) to reach a good accuracy (~1000 epochs) even when I use GPU acceleration. The resulting accuracy and loss functions are:
At this point, giving the good performance but the very slow training I suspect a bug in my code. I tested it using the golden tests mentioned here, which consist of running tests with 2 points only either in the testing set or the training set along with eliminating the dropouts. Unfortunately, the outputs of these runs result in testing accuracy better than the training accuracy, which should not be the case as far as I know. I suspect that I am shaping my data in the wrong way. Any hints, suggestions and advises are appreciated.
My code is the following:
# -*- coding: utf-8 -*-
import keras
import numpy as np
from time import time
from utils import dmanip, vis
from keras.models import Sequential
from keras.layers import LSTM, Dense
from keras.utils import to_categorical
from keras.callbacks import TensorBoard
from sklearn.preprocessing import LabelEncoder
from tensorflow.python.client import device_lib
from sklearn.model_selection import train_test_split
###############################################################################
####################### Extract the data from .csv file #######################
###############################################################################
# get data
data, column_names = dmanip.get_data(file_path='../data_one_outcome.csv')
# split data
X = data.iloc[:, :-1]
y = data.iloc[:, -1:].astype('category')
###############################################################################
########################## init global config vars ############################
###############################################################################
# check if GPU is used
print(device_lib.list_local_devices())
# init
n_epochs = 1500
n_comps = X.shape[1]
###############################################################################
################################## Keras RNN ##################################
###############################################################################
# encode the classification labels
le = LabelEncoder()
yy = to_categorical(le.fit_transform(y))
# split the dataset
x_train, x_test, y_train, y_test = train_test_split(X, yy, test_size=0.35,
random_state=True,
shuffle=True)
# expand dimensions
x_train = np.expand_dims(x_train, axis=2)
x_test = np.expand_dims(x_test, axis=2)
# define model
model = Sequential()
model.add(LSTM(units=n_comps, return_sequences=True,
input_shape=(x_train.shape[1], 1),
dropout=0.2, recurrent_dropout=0.2))
model.add(LSTM(64, return_sequences=True, dropout=0.2, recurrent_dropout=0.2))
model.add(LSTM(32, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(4 ,activation='softmax'))
# print model architecture summary
print(model.summary())
# compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Create a TensorBoard instance with the path to the logs directory
tensorboard = TensorBoard(log_dir='./logs/rnn/{}'.format(time()))
# fit the model
history = model.fit(x_train, y_train, epochs=n_epochs, batch_size=100,
validation_data=(x_test, y_test), callbacks=[tensorboard])
# plot results
vis.plot_nn_stats(history=history, stat_type="accuracy", fname="RNN-accuracy")
vis.plot_nn_stats(history=history, stat_type="loss", fname="RNN-loss")
My data is a large 2d matrix (38607, 150), where 149 is the number of features and 38607 is the number of samples, with a target vector including 4 classes.
feat1 feat2 ... feat148 feat149 target
1 2.250 0.926 ... 16.0 0.0 class1
2 2.791 1.235 ... 1.0 0.0 class2
. . . . . .
. . . . . .
. . . . . .
38406 2.873 1.262 ... 281.0 0.0 class3
38407 3.222 1.470 ... 467.0 1.0 class4

Regarding the Slowness of Training: You can think of using tf.data instead of Data Frames and Numpy Arrays because, Achieving peak performance requires an efficient input pipeline that delivers data for the next step before the current step has finished. The tf.data API helps to build flexible and efficient input pipelines.
For more information regarding tf.data, please refer this Tensorflow Documentation 1, Documentation 2.
This Tensorflow Tutorial guides you to convert your Data Frame to tf.data format.
One more important feature of use to you can be tf.profiler. Using Tensorflow Profiler, you can not only Visualize the Time and Memory Consumed in each phase of Data Science Project but it also provides us a Suggestion/Recommendation to reduce the Time/Memory Consumption and hence to Optimize our Project.
For more information on Tensorflow Profiler, refer this Documentation, this Tutorial and this Tensorflow DevSummit Youtube Video.
Regarding Testing Accuracy more than Training Accuracy: This is not a big problem and happens sometimes.
Probable Reason 1: Dropout ==> What is the reason for you to use Dropout and recurrent_dropout in your Model? Was the Model Overfitting? If the Model is not Overfitting, without Dropout and recurrent_dropout, then you can think of removing them because, If you set Dropout (0.2) and recurrent_dropout (0.2) it means 20% of features will be 0 and 20% of Time Steps will be 0, during training. However, during testing all the Features and Timesteps are used, so the model is more robust and have better testing accuracy.
Probable Reason 2: 35% of Testing Data is bit more than usual. You can make it either 20% or 25%.
Probable Reason 3: Your training data might have several arduous cases to learn and Your Testing data may contain easier cases to predict. To mitigate this, you can Split the Data Once again with different Random Seed.
For more information, please refer this Research Gate Link and this Stack Overflow Link.
Hope this helps. Happy Learning!

intel daal4py classifiers with scikit-learn

I am testing the sklearn-compatible wrappers for the latest version of the intel daal4py classifiers. The intel k-nearest classifier works fine with sklearn’s cross_val_score() and GridSearchCV. The performance boost from the intel classifier is significant and the intel and sklearn models provide generally comparable results across 10 different large public datasets and some simulated datasets.
The sklearn-compatible wrapper for the intel random forest classifier seems to be completely broken. The score() method does not work so I cannot proceed further with the intel random forest wrapper class.
I posted this at the intel AI Developer Forum, but I was wondering if anyone here has gotten the intel sklearn-compatible random forest classifier to work.
My next step is to test the native daal4py random forest object and possibly write my own wrapper because the native daal4py api is so different from sklearn. I was hoping to avoid this.
There seems to be some confusion on the intel site regarding the names of the wrapper classes.
I am using:
For k-nearest: daal4py.sklearn.neighbors.kdtree_knn_classifier (this
works fine)
For random forest:
daal4py.sklearn.ensemble.decision_forest.RandomForestClassifier
The failure in the intel RandomForestClassifier is in forest.py because n_classes_ is an int. n_classes_ matches the number of classes for the label variable that is passed. The label variable is an integer.
predictions = [np.zeros((n_samples, n_classes_[k]))
for k in range(self.n_outputs_)]

Please find below the steps we used to compute scores for daal4py RandomForestClassifier
(i) For cross_val_score
from daal4py.sklearn.ensemble.decision_forest import RandomForestClassifier
from sklearn.model_selection import cross_val_score
clf = RandomForestClassifier()
scores = cross_val_score(clf, train_data, train_labels, cv=3)
print(scores)
(ii)For GridSearchCV
from sklearn.model_selection import GridSearchCV
from daal4py.sklearn.ensemble.decision_forest import RandomForestClassifier
param_grid = {
'n_estimators': [200, 700],
'max_features': ['auto', 'sqrt', 'log2']
}
clf = RandomForestClassifier()
CV_rfc = GridSearchCV(estimator=clf, param_grid=param_grid, cv= 5)
CV_rfc.fit(train_data, train_labels)
score=CV_rfc.score(train_data, train_labels)

Neural network regression with skewed data

I have been trying to build a machine learning model using Keras which predicts the radiation dose based on pre-treatment parameters. My dataset has approximately 2200 samples of which 20% goes into validation and testing.
The problem with the target variable is that it is very skewed since large radiation doses are much more rare than the small ones. Hence, I suspect that my regression model fails to predict the large values at all, and predicts everything around the mean, which is apparent from the figure. I have tried to log-normalise the target variable to make it more normally distributed, but it has had no effect.
Any suggestion how to fix this?
Target variable
Regression predictions

Computing individual sample weights based on 10 histogram bins helped in my case. See the code below:
import pandas as pd
import numpy as np
from sklearn.utils.class_weight import compute_sample_weight
hist, bin_edges = np.histogram(training_targets, bins = 10)
classes = training_targets.apply(lambda x: pd.cut(x, bin_edges, labels = False,
include_lowest = True)).values
sample_weights = compute_sample_weight('balanced', classes)

Use caffe to simulate the SGDclassifier or Logisticregression linear models in sklearn

I'm trying to use caffe to simulate the SGDclassifier and Logisticregression linear models in sklearn. As we all know, in caffe, one "InnerProduct" layer plus one "Softmaxwithloss" layer represent a logistic regression Y = Logit(WX+b).
I'm now using the digits dataset in the sklearn datasets package as the trianing set(5/6 of all the data-label pairs) and testing set(the rest 1/6). However, the accuracy obtained by SGDclassifer() or LogisticRegression() could reach nearly 90%, while the accuracy obtained by two-layer Neural Network cannot exceed 30% after training. Is this because of the parameter settings or something else? The gap between them is just kind of too large.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string