I am new in machine learning. I am using SGDClassifier to classify my documents. I trained the model. To persist the trained data I used pickle
code in classify.py for training model
corpus=df2.title_desc #df2 is my dataframe with 2 columns title_desc and category
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix=vectorizer.fit_transform(corpus).todense()
variables = tfidf_matrix
labels = df2.category
variables_train, variables_test, labels_train, labels_test = train_test_split(variables, labels, test_size=0.1)
svm_classifier=linear_model.SGDClassifier(loss='hinge',alpha=0.0001)
svm_classifier=svm_classifier.fit(variables_train, labels_train)
with open('my_dumped_classifier.pkl', 'wb') as fid:
pickle.dump(svm_classifier, fid)
After the data is dumped to a file.I created another py file to test the model
test.py
corpus_test=df_test.title_desc #df_testis my dataframe with 2 columns title_desc and category
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix_test=vectorizer.fit_transform(corpus_test).todense()
svm_classifier=linear_model.SGDClassifier(loss='hinge',alpha=0.0001)
with open('my_dumped_classifier.pkl', 'rb') as fid:
svm_classifier = pickle.load(fid)
tfidf_matrix_test=vectorizer.transform(corpus_test).todense()
svm_predictions=svm_classifier.predict(tfidf_matrix_test)
I am not sure about the logic I have give in test.py. In line
svm_predictions=svm_classifier.predict(tfidf_matrix_test)
its an error 'ValueError: X has 249 features per sample; expecting 1050'
Please give a solution.
Related
I have a Tensorflow regression model that i have with been working with. I have the model tuned well and getting good results while training. However, when i goto evalute the results are horrible. I did some research and found that i am not normalizing my test features and labels as well so i suspect that is where the problem is. My thought is to normalize the whole dataset before splitting the dataset into train and test sets but i am getting an attribute error that has me stumped.
here is the code sample. Please help :)
#concatenate the surface data and single_downhole_col into a single dataframe
training_Data =[]
training_Data = pd.concat([surface_Data, single_downhole_col], axis=1)
#print('training data shape:',training_Data.shape)
#print(training_Data.head())
#normalize the data using keras
model_normalizer_layer = tf.keras.layers.Normalization(axis=-1)
model_normalizer_layer.adapt(training_Data)
normalized_training_Data = model_normalizer_layer(training_Data)
#convert the data frame to array
dataset = normalized_training_Data.copy()
dataset.tail()
#create a training and test set
train_dataset = dataset.sample(frac=0.8, random_state=0)
test_dataset = dataset.drop(train_dataset.index)
#check the data
train_dataset.describe().transpose()
#split features from labels
train_features = train_dataset.copy()
test_features = test_dataset.copy()
and if there is any interest in knowing how the normalizer layer is used in the model then please see below
def build_and_compile_model(data):
model = keras.Sequential([
model_normalizer_layer,
layers.Dense(260, input_dim=401,activation='relu'),
layers.Dense(80, activation='relu'),
#layers.Dense(40, activation='relu'),
layers.Dense(1)
])
i found that quasimodos suggestion of using normalization of the data set before processing in my model was the ideal solution. It scaled the data 0 to 1 for all columns as expected and allowed me to display the data prior to training to validate it was correct.
For whatever reason the keras.layers.normalization was not working in my case.
x = training_Data.values
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
training_Data = pd.DataFrame(x_scaled)
# normalize the data using keras
model_normalizer_layer = tf.keras.layers.Normalization(axis=-1)
model_normalizer_layer.adapt(training_Data)
normalized_training_Data = model_normalizer_layer(training_Data)
The only part that i have yet to figure out is how do i scale the predict data from the model back to the original ranges of the column??? i'm sure its simple but i'm stumped.
I created a multi-class classification model with Linear SVM. But I am not able to classify a new loaded dataframe (my base that must be classified) I have the following error.
What should I do to convert my new text(df.reason_text) to TFID and classify(call model.prediction(?)) with my model?
Training Model
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(ngram_range=(1,2), stop_words=stopwords)
features = tfidf.fit_transform(training.Description).toarray()
labels = training.category_id
model = LinearSVC()
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(features, labels, training.index, test_size=0.33, random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Now I'm not able to convert my new dataframe to classify
Load New DataFrame by Classification
from pyathena import connect
import pandas as pd
conn = connect(s3_staging_dir='s3://athenaxxxxxxxx/result/',
region_name='us-east-2')
df = pd.read_sql("select * from data.classification_text_reason", conn)
features2 = tfidf.fit_transform(df.reason_text).toarray()
features2.shape
After I convert the new data frame text with TFID and have it sorted, I get the following message
y_pred1 = model.predict(features2)
error
ValueError: X has 1272 features per sample; expecting 5319
'
When you are loading a new DF for classification, you are calling fit_tranform() again, but you should be calling only transform().
fit_transform() description: Learn vocabulary and idf, return term-document matrix.
transform() description: Transform documents to document-term matrix.
You need to use the transformer created when training the algorithm, so the code would be:
tfidf.transform(df.reason_text).toarray()
If you still have the feature shape error, there may be a problem with the shapes of the arrays. Solve the transform part and if the error still occurs, post an example of the train and the test data in array format, I will keep helping.
I'm trying to make a textclassification model with sklearn. I'm quite new to python and also sklearn. I already made the model with some training data and saved the model. But there's an error when I try to reuse the model in an another python program/file.
I already looked in some similar problems here on stackoverflow, but I couldn't find a solution for me.
I made some comments, so you can read the code more easily.
...
# load the dataset
data = codecs.open('C:/Users/baran/PycharmProjects/test/resource/CorpusMitLabelsPlusSonstige.txt', encoding='utf8',
errors='ignore').read ()
# seperate lables from text
labels, texts = [], []
for i, line in enumerate(data.split("\n")):
content = line.split()
labels.append(content[0])
texts.append(" ".join(content[1:]))
# create a dataframe using texts and lables
trainDF = pandas.DataFrame()
trainDF['text'] = texts
trainDF['label'] = labels
# split the dataset into training and validation datasets
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(trainDF['text'], trainDF['label'])
# label encode the target variable
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
# create a count vectorizer object
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(trainDF['text'])
# transform the training and validation data using count vectorizer object
xtrain_count = count_vect.transform(train_x)
xvalid_count = count_vect.transform(valid_x)
...
And since I was training with different methods to evaluate which was better I made a train_model method.
...
def train_model(classifier, feature_vector_train, label, feature_vector_valid, is_neural_net=False, is_not_tfid=False,
correct_model=False):
# fit the training dataset on the classifier
...
elif correct_model:
classifier.fit(feature_vector_train, label)
pkl_filename = "C:/Users/baran/PycharmProjects/test/resources/pickle_model.pkl"
with open(pkl_filename, 'wb') as file:
pickle.dump(classifier, file)
# with open(pkl_filename, 'rb') as file:
# pickle_model = pickle.load(file)
# joblib.dump(classifier, "C:/Users/baran/PycharmProjects/test/resources/model.pkl")
# loaded_model = joblib.load("C:/Users/baran/PycharmProjects/test/resources/model.pkl")
# result = loaded_model.score(feat)
# print(pickle_model.predict(feature_vector_valid))
...
# predict the labels on validation dataset
predictions = classifier.predict(feature_vector_valid)
...
return metrics.accuracy_score(valid_y, predictions)
...
This is the "correct_model":
...
# Linear Classifier on Count Vectors
accuracy = train_model(linear_model.LogisticRegression(), xtrain_count, train_y, xvalid_count, correct_model=True)
print("LR, Count Vectors: ", accuracy)
...
This model gives me something around 80% accuracy on the validation data.
So this is my test file where I wanted to test, if I can load and reuse the model:
...
texts = []
texts.append("Der Bus hat nicht an der Haltestelle gehalten")
# create a dataframe using texts and lables
trainDF = pandas.DataFrame()
trainDF['text'] = texts
# create a count vectorizer object
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(trainDF['text'])
# transform the training and validation data using count vectorizer object
test_data = count_vect.transform(trainDF['text'])
# load the model
pkl_filename = "C:/Users/baran/PycharmProjects/test/resources/pickle_model.pkl"
with open(pkl_filename, 'rb') as file:
pickle_model = pickle.load(file)
#reuse the model
test_load = joblib.load("C:/Users/baran/PycharmProjects/test/model.pkl")
print(test_load.predict(test_data))
...
Then I get this error:
...
ValueError: X has 7 features per sample; expecting 18282
What I expected is, that it will give me "3" as a result which is the encoding for a specific label. These predictions works in the same file where I also train the model, but somehow I can not use new validation data.
I think I made some mistake when fitting and or transforming the data.
Below is some code for a classifier. I used pickle to save and load the classifier instructed in this page. However, when I load it to use it, I cannot use the CountVectorizer() and TfidfTransformer() to convert raw text into vectors that the classifier can use.
The only I was able to get it to work is analyze the text immediately after training the classifier, as seen below.
import os
import sklearn
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
import nltk
import pandas
import pickle
class Classifier:
def __init__(self):
self.moviedir = os.getcwd() + '/txt_sentoken'
def Training(self):
# loading all files.
self.movie = load_files(self.moviedir, shuffle=True)
# Split data into training and test sets
docs_train, docs_test, y_train, y_test = train_test_split(self.movie.data, self.movie.target,
test_size = 0.20, random_state = 12)
# initialize CountVectorizer
self.movieVzer = CountVectorizer(min_df=2, tokenizer=nltk.word_tokenize, max_features=5000)
# fit and tranform using training text
docs_train_counts = self.movieVzer.fit_transform(docs_train)
# Convert raw frequency counts into TF-IDF values
self.movieTfmer = TfidfTransformer()
docs_train_tfidf = self.movieTfmer.fit_transform(docs_train_counts)
# Using the fitted vectorizer and transformer, tranform the test data
docs_test_counts = self.movieVzer.transform(docs_test)
docs_test_tfidf = self.movieTfmer.transform(docs_test_counts)
# Now ready to build a classifier.
# We will use Multinominal Naive Bayes as our model
# Train a Multimoda Naive Bayes classifier. Again, we call it "fitting"
self.clf = MultinomialNB()
self.clf.fit(docs_train_tfidf, y_train)
# save the model
filename = 'finalized_model.pkl'
pickle.dump(self.clf, open(filename, 'wb'))
# Predict the Test set results, find accuracy
y_pred = self.clf.predict(docs_test_tfidf)
# Accuracy
print(sklearn.metrics.accuracy_score(y_test, y_pred))
self.Categorize()
def Categorize(self):
# very short and fake movie reviews
reviews_new = ['This movie was excellent', 'Absolute joy ride', 'It is pretty good',
'This was certainly a movie', 'I fell asleep halfway through',
"We can't wait for the sequel!!", 'I cannot recommend this highly enough', 'What the hell is this shit?']
reviews_new_counts = self.movieVzer.transform(reviews_new) # turn text into count vector
reviews_new_tfidf = self.movieTfmer.transform(reviews_new_counts) # turn into tfidf vector
# have classifier make a prediction
pred = self.clf.predict(reviews_new_tfidf)
# print out results
for review, category in zip(reviews_new, pred):
print('%r => %s' % (review, self.movie.target_names[category]))
With MaximeKan's suggestion, I researched a way to save all 3.
saving the model and the vectorizers
import pickle
with open(filename, 'wb') as fout:
pickle.dump((movieVzer, movieTfmer, clf), fout)
loading the model and the vectorizers for use
import pickle
with open('finalized_model.pkl', 'rb') as f:
movieVzer, movieTfmer, clf = pickle.load(f)
This is happening because you should not only save the classifier, but also the vectorizers. Otherwise, you are retraining the vectorizers on unseen data, which obviously will not contain the exact same words than the train data, and the dimension will change. This is an issue, because your classifier is expecting a certain input format to be provided.
Thus, the solution for your problem is quite simple: you should also save your vectorizers as pickle files and load them along with your classifier before using them.
Note: to avoid having two objects to save and to load, you could consider putting them together in a pipeline, which is equivalent.
I am wondering if it is possible to train spark word2vec in batch mode. Or in other words, if it is possible to update the vocabulary list of a spark word2vec model which is already trained.
My application is:
my paragraphs are located in multiple files, and when I use gensim i can do
class MySentences(object):
def __init__(self, file_list, folder):
self.file_list = file_list
self.folder = folder
def __iter__(self):
for file in self.file_list:
if 'walk_' in file:
print file
with open(self.folder + file, 'r') as f:
for line in f:
yield line.split()
model = Word2Vec(MySentences(files, fileFolder), size=32, window=5, min_count=5, workers=15)
i can even do
for epoch in range(10):
model.train(MySentences(files, fileFolder))
I am wondering how I can do similar things in spark word2vec.
In spark, I found I can only do RDD union with multiple files as:
from pyspark.mllib.feature import Word2Vec
from pyspark.sql import SQLContext
inp1 = sc.textFile("file1").map(lambda row: row.split('\t'))
inp2 = sc.textFile("file2").map(lambda row: row.split('\t'))
inp = sc.union([inp1,inp2])
word2vec = Word2Vec().setVectorSize(4).setMinCount(1)
model = word2vec.fit(inp)
otherwise, if I train model with inp1, then inp2, the words from inp1 will be gone.
If i cannot do the training on batch mode, how can i update a trained model with new paragraphs in future?
I think you can:
for idx in range(1, 100, 1):
model = word2vec.fit(data.sample(False, 0.01))
model.save(sc, path)
Not sure if the sample function always takes unseen data in this example.