Error when making predictions on un-pickled classifier - python-3.x

I am making a text classifying program which has input of over thousand emails, so for convenience I have decided to save the classifier in a pickled file after the training is complete, so that after further executions of the program, I wont have to retrain it.
path = 'classifier.pkl'
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
if not os.path.exists(path):
# making a classifier
clf.fit(x_train, y_train)
with open(path, 'wb') as f:
pickle.dump(clf, f)
else:
print('<classifier found!>')
input_file = open(path, 'rb')
clf = pickle.load(input_file)
input_file.close()
pred = clf.predict(x_test) # the error occurs on this line
The prediction works on first run (when classifier is not a file input). But it gives me this error on next executions:
ValueError: operands could not be broadcast together with shapes
(3516,379) (376,)
shapes of x_train and x_test are as follows: (14062, 379), (3516, 379)
Any help would be appreciated
Edit: I have tried desertnaut's suggestion of pickling pred = clf.predict(x_test) and using it in further runs of the program, and accuracy score I get from those runs seem to be twice as low as the score when initially training the classifier

Could not figure out why pickling does not work. However, sklearn's joblib function seems to work just fine.
from sklearn.externals import joblib
if not os.path.exists(path):
clf = clf.fit(x_train, y_train)
joblib.dump(clf, path)
else:
clf = joblib.load(path)

Related

Tensorflow model prediction failes when ran right after model training

I'm having troubles with my model prediction. The training works fine but afterwards my program fails while predicting the trained model. When I rerun my code the training is now skipped because its already done, the prediction works now fine as its supposed to. In google I find this error only with regard to model training so i guess the solutions don't work for me. I think the reason for my error is, that my video ram is not entirely freed after model training. That's why I tried the following without success.
tf.keras.backend.clear_session()
tf.compat.v1.reset_default_graph()
K.clear_session()
Error code:
prediction = model.predict(x)[:, 0]#.flatten() # flatten was needed now
File "/home/max/PycharmProjects/Masterthesis/venv3-8-12/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/max/PycharmProjects/Masterthesis/venv3-8-12/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 106, in convert_to_eager_tensor
return ops.EagerTensor(value, ctx.device_name, dtype)
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.
Do you have any ideas on how to solve the problem?
My Setup:
Python: 3.8.12
Tensorflow-gpu: 2.7.0
System: Manjaro Linux
Cuda: 11.5
GPU: NVIDIA GeForce GTX 980 Ti
My Code:
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.layers import Input, LSTM, Dense, Dropout
import tensorflow as tf
import h5py
import keras.backend as K
def loss_function(y_true, y_pred):
alpha = K.std(y_pred) / K.std(y_true)
beta = K.sum(y_pred) / K.sum(y_true)
error = K.sqrt( + K.square(1 - alpha) + K.square(1 - beta))
return error
i = Input(shape=(171, 11))
x = LSTM(100, return_sequences=True)(i)
x = LSTM(50)(x)
x = Dropout(0.1)(x)
out = Dense(1)(x)
model = Model(i, out)
model.compile(
loss=loss_function,
optimizer=tf.keras.optimizers.Adam(learning_rate=0.001))
with h5py.File("db.hdf5", 'r') as db_:
r = model.fit(
db_["X_train"][...],
db_["Y_train"][...],
epochs=1,
batch_size=64,
verbose=1,
shuffle=True)
model.save("model.h5")
model = load_model("model.h5", compile=False)
with h5py.File("db.hdf5", 'r') as db:
x = db["X_val"][...]
y = db["Y_val"][...].flatten()
prediction = model.predict(x)[:, 0].flatten()
I found the solution to my problem. Since I'm using a custom loss function, I somehow needed to specify the custom loss function when loading the model again.
I accomplished this by modifying this line
model = load_model("model.h5", compile=False)
to this one
model = load_model("model.h5", custom_objects={"loss_function": loss_function})

TimeSeries NLP: Using ARIMA with CountVectorizer

I'm practicing on the kaggle news headline dataset on the DJIA prices as exported from Yahoo Finance: https://www.kaggle.com/aaron7sun/stocknews#Combined_News_DJIA.csv
There are not many discussions on NLP with TimeSeries, I attempted using this article's code using CountVectorizer() however unsuccessful. I was wondering if anyone has any resources or suggestions?
My code below based on headline in dataset above:
def modeller(vect, X_tr, y_tr, X_te):
X_train_dtm = vect.fit_transform(X_tr.unstack())
X_test_dtm = vect.fit_transform(X_te.unstack())
X_tr_arima = [x for x in X_train_dtm]
print('done with count vectorizer. now modelling.')
model = ARIMA(X_tr_arima, order=(1,1,1))
print('done modelling. now fitting')
model_fit = model.fit(X_tr_arima, y_tr)
y_hat = model.predict(x_te_arima)
return y_hat
vect = CountVectorizer(stop_words='english')
X_train, X_test, y_train, y_test = X.iloc[0:100], X.iloc[100:X.shape[0]], y[0:100], y[100:len(y)]
modeller(vect, X_train, y_train, X_test)
Output (error from ARIMA line):
ValueError: setting an array element with a sequence.
I had the same problem and I could fix it using this approach.
Try to change
from pmdarima.pipeline import Pipeline
to
from pmdarima.pipeline import Pipeline as arimaPip

SVM classification - Bad input shape Error

Im having an error bad input shape I tried searching but I can't understand yet since im new in SVM.
train.csv
testing.csv
# importing required libraries
import numpy as np
# import support vector classifier
from sklearn.svm import SVC
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
X = pd.read_csv("train.csv")
y = pd.read_csv("testing.csv")
clf = SVC()
clf.fit(X, y)
clf.decision_function(X)
print(clf.predict(X))
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (1, 6)
The problem here is that you are just inserting your entire table with the training data (plus labels) as the input for just the training data and then try to predict the table of the testing data (data and labels) with the SVM.
This does not work that way.
What you need to do, is to train the SVM with your training data (so data points + label for each data point) and then test it against your testing data (testing data points + labels).
Your code should look like this instead:
# Load training and testing dataset from .csv files
training_dataset = pd.read_csv("train.csv")
testing_dataset = pd.read_csv("testing.csv")
# Load training data points with all relevant features
X_train = training_dataset[['feature1','feature2','feature3','feature4']]
# Load training labels from dataset
y_train = training_dataset['label']
# Load testing data points with all relevant features
X_test = testing_dataset[['feature1','feature2','feature3','feature4']]
# Load testing labels from dataset
y_test = testing_dataset['label']
clf = SVC()
# Train the SVC with the training data (data points and labels)
clf.fit(X_train, y_train)
# Evaluate the decision function with test samples
clf.decision_function(X_test)
# Predict the test samples
print(clf.predict(X_test))
I hope that helps and that this code runs for you. Let me know if I misunderstood something or you have more questions. :)

How to save classifier in sklearn with Countvectorizer() and TfidfTransformer()

Below is some code for a classifier. I used pickle to save and load the classifier instructed in this page. However, when I load it to use it, I cannot use the CountVectorizer() and TfidfTransformer() to convert raw text into vectors that the classifier can use.
The only I was able to get it to work is analyze the text immediately after training the classifier, as seen below.
import os
import sklearn
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
import nltk
import pandas
import pickle
class Classifier:
def __init__(self):
self.moviedir = os.getcwd() + '/txt_sentoken'
def Training(self):
# loading all files.
self.movie = load_files(self.moviedir, shuffle=True)
# Split data into training and test sets
docs_train, docs_test, y_train, y_test = train_test_split(self.movie.data, self.movie.target,
test_size = 0.20, random_state = 12)
# initialize CountVectorizer
self.movieVzer = CountVectorizer(min_df=2, tokenizer=nltk.word_tokenize, max_features=5000)
# fit and tranform using training text
docs_train_counts = self.movieVzer.fit_transform(docs_train)
# Convert raw frequency counts into TF-IDF values
self.movieTfmer = TfidfTransformer()
docs_train_tfidf = self.movieTfmer.fit_transform(docs_train_counts)
# Using the fitted vectorizer and transformer, tranform the test data
docs_test_counts = self.movieVzer.transform(docs_test)
docs_test_tfidf = self.movieTfmer.transform(docs_test_counts)
# Now ready to build a classifier.
# We will use Multinominal Naive Bayes as our model
# Train a Multimoda Naive Bayes classifier. Again, we call it "fitting"
self.clf = MultinomialNB()
self.clf.fit(docs_train_tfidf, y_train)
# save the model
filename = 'finalized_model.pkl'
pickle.dump(self.clf, open(filename, 'wb'))
# Predict the Test set results, find accuracy
y_pred = self.clf.predict(docs_test_tfidf)
# Accuracy
print(sklearn.metrics.accuracy_score(y_test, y_pred))
self.Categorize()
def Categorize(self):
# very short and fake movie reviews
reviews_new = ['This movie was excellent', 'Absolute joy ride', 'It is pretty good',
'This was certainly a movie', 'I fell asleep halfway through',
"We can't wait for the sequel!!", 'I cannot recommend this highly enough', 'What the hell is this shit?']
reviews_new_counts = self.movieVzer.transform(reviews_new) # turn text into count vector
reviews_new_tfidf = self.movieTfmer.transform(reviews_new_counts) # turn into tfidf vector
# have classifier make a prediction
pred = self.clf.predict(reviews_new_tfidf)
# print out results
for review, category in zip(reviews_new, pred):
print('%r => %s' % (review, self.movie.target_names[category]))
With MaximeKan's suggestion, I researched a way to save all 3.
saving the model and the vectorizers
import pickle
with open(filename, 'wb') as fout:
pickle.dump((movieVzer, movieTfmer, clf), fout)
loading the model and the vectorizers for use
import pickle
with open('finalized_model.pkl', 'rb') as f:
movieVzer, movieTfmer, clf = pickle.load(f)
This is happening because you should not only save the classifier, but also the vectorizers. Otherwise, you are retraining the vectorizers on unseen data, which obviously will not contain the exact same words than the train data, and the dimension will change. This is an issue, because your classifier is expecting a certain input format to be provided.
Thus, the solution for your problem is quite simple: you should also save your vectorizers as pickle files and load them along with your classifier before using them.
Note: to avoid having two objects to save and to load, you could consider putting them together in a pipeline, which is equivalent.

Keras Model Accuracy differs after loading the same saved model

I trained a Keras Sequential Model and Loaded the same later. Both the model are giving different accuracy.
I have came across a similar question but was not able solve the problem.
Sample Code :
Loading and Traing the model
model = gensim.models.FastText.load('abc.simple')
X,y = load_data()
Vectors = np.array(vectors(X))
X_train, X_test, y_train, y_test = train_test_split(Vectors, np.array(y),
test_size = 0.3, random_state = 0)
X_train = X_train.reshape(X_train.shape[0],100,max_tokens,1)
X_test = X_test.reshape(X_test.shape[0],100,max_tokens,1)
data for input to our model
print(X_train.shape)
model2 = train()
score = model2.evaluate(X_test, y_test, verbose=0)
print(score)
Training Accuracy is 90%.
Saved the Model
# Saving Model
model_json = model2.to_json()
with open("model_architecture.json", "w") as json_file:
json_file.write(model_json)
model2.save_weights("model_weights.h5")
print("Saved model to disk")
But after I restarted the kernel and just loaded the saved model and runned it on same set of data, accuracy got reduced.
#load json and create model
json_file = open('model_architecture.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
#load weights into new model
loaded_model.load_weights("model_weights.h5")
print("Loaded model from disk")
# evaluate loaded model on test data
loaded_model.compile(loss='binary_crossentropy', optimizer='rmsprop',
metrics=['accuracy'])
score = loaded_model.evaluate(X_test, y_test, verbose=0)
print(score)
Accuracy got reduced to 75% on the same set of data.
How to make it consistent ?
I have tried the following but of no help :
from keras.backend import manual_variable_initialization
manual_variable_initialization(True)
Even , I saved the whole model at once( weights and architecture) but was not able to solve this issue
Not sure, if your issue has been solved but for future comers.
I had exactly the same problem with saving and loading the weights. So on loading the model the accuracy and loss were changed greatly from 68% accuracy to 2 %. In my experiment, I am using Tensorflow as backend with Keras model layers Embedding, LSTM and Dense. My issue got solved by fixing the seed for keras which uses NumPy random generator and since I am using Tensorflow as backend, I also fixed the seed for it.
These are the lines I added at the top of my file where the model is also defined.
from numpy.random import seed
seed(42)# keras seed fixing
import tensorflow as tf
tf.random.set_seed(42)# tensorflow seed fixing
I hope this helps.
For more information have a look at this- https://machinelearningmastery.com/reproducible-results-neural-networks-keras/
I had the same problem due to a silly mistake of mine - after loading the model I had in my data generator the shuffle option (useful for the training) turned to True instead of False. After changing it to False the model predicted as expected. It would be nice if keras could take care of this automatically. This is my critical code part:
pred_generator = pred_datagen.flow_from_directory(
directory='./ims_dir',
target_size=(100, 100),
color_mode="rgb",
batch_size=1,
class_mode="categorical",
shuffle=False,
)
model = load_model(logpath_ms)
pred=model.predict_generator(pred_generator, steps = N, verbose=1)
My code worked when I scaled my dataset before reevaluating the model. I did this treatment before saving the model and had forgotten to repeat this procedure when I opened the model and wanted to evaluate it again. After I did that, the accuracy value appeared as it should \o/
model_saved = keras.models.load_model('tuned_cnn_1D_HAR_example.h5')
trainX, trainy, testX, testy = load_dataset()
trainX, testX = scale_data(trainX, testX, True)
score = model_saved.evaluate(testX, testy, verbose=0)
print("%s: %.2f%%" % (model_saved.metrics_names[1], score[1]*100))
inside of my function scale_data I used StandardScaler()

Resources