I'm trying to make a textclassification model with sklearn. I'm quite new to python and also sklearn. I already made the model with some training data and saved the model. But there's an error when I try to reuse the model in an another python program/file.
I already looked in some similar problems here on stackoverflow, but I couldn't find a solution for me.
I made some comments, so you can read the code more easily.
...
# load the dataset
data = codecs.open('C:/Users/baran/PycharmProjects/test/resource/CorpusMitLabelsPlusSonstige.txt', encoding='utf8',
errors='ignore').read ()
# seperate lables from text
labels, texts = [], []
for i, line in enumerate(data.split("\n")):
content = line.split()
labels.append(content[0])
texts.append(" ".join(content[1:]))
# create a dataframe using texts and lables
trainDF = pandas.DataFrame()
trainDF['text'] = texts
trainDF['label'] = labels
# split the dataset into training and validation datasets
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(trainDF['text'], trainDF['label'])
# label encode the target variable
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
# create a count vectorizer object
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(trainDF['text'])
# transform the training and validation data using count vectorizer object
xtrain_count = count_vect.transform(train_x)
xvalid_count = count_vect.transform(valid_x)
...
And since I was training with different methods to evaluate which was better I made a train_model method.
...
def train_model(classifier, feature_vector_train, label, feature_vector_valid, is_neural_net=False, is_not_tfid=False,
correct_model=False):
# fit the training dataset on the classifier
...
elif correct_model:
classifier.fit(feature_vector_train, label)
pkl_filename = "C:/Users/baran/PycharmProjects/test/resources/pickle_model.pkl"
with open(pkl_filename, 'wb') as file:
pickle.dump(classifier, file)
# with open(pkl_filename, 'rb') as file:
# pickle_model = pickle.load(file)
# joblib.dump(classifier, "C:/Users/baran/PycharmProjects/test/resources/model.pkl")
# loaded_model = joblib.load("C:/Users/baran/PycharmProjects/test/resources/model.pkl")
# result = loaded_model.score(feat)
# print(pickle_model.predict(feature_vector_valid))
...
# predict the labels on validation dataset
predictions = classifier.predict(feature_vector_valid)
...
return metrics.accuracy_score(valid_y, predictions)
...
This is the "correct_model":
...
# Linear Classifier on Count Vectors
accuracy = train_model(linear_model.LogisticRegression(), xtrain_count, train_y, xvalid_count, correct_model=True)
print("LR, Count Vectors: ", accuracy)
...
This model gives me something around 80% accuracy on the validation data.
So this is my test file where I wanted to test, if I can load and reuse the model:
...
texts = []
texts.append("Der Bus hat nicht an der Haltestelle gehalten")
# create a dataframe using texts and lables
trainDF = pandas.DataFrame()
trainDF['text'] = texts
# create a count vectorizer object
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(trainDF['text'])
# transform the training and validation data using count vectorizer object
test_data = count_vect.transform(trainDF['text'])
# load the model
pkl_filename = "C:/Users/baran/PycharmProjects/test/resources/pickle_model.pkl"
with open(pkl_filename, 'rb') as file:
pickle_model = pickle.load(file)
#reuse the model
test_load = joblib.load("C:/Users/baran/PycharmProjects/test/model.pkl")
print(test_load.predict(test_data))
...
Then I get this error:
...
ValueError: X has 7 features per sample; expecting 18282
What I expected is, that it will give me "3" as a result which is the encoding for a specific label. These predictions works in the same file where I also train the model, but somehow I can not use new validation data.
I think I made some mistake when fitting and or transforming the data.
Related
I have been trying to convert the pretrained slowfast_r50 model to torchscript. But getting the following error. Could anyone help me out on this matter? Is it possible to convert the existing pretrained pytorchvideo model to torchscript or ONNX format? Thanks.
import torch
import json
from torchvision.transforms import Compose, Lambda
from torchvision.transforms._transforms_video import (
CenterCropVideo,
NormalizeVideo,
)
from pytorchvideo.data.encoded_video import EncodedVideo
from pytorchvideo.transforms import (
ApplyTransformToKey,
ShortSideScale,
UniformTemporalSubsample,
UniformCropVideo
)
with open("kinetics_classnames.json", "r") as f:
kinetics_classnames = json.load(f)
# Create an id to label name mapping
kinetics_id_to_classname = {}
for k, v in kinetics_classnames.items():
kinetics_id_to_classname[v] = str(k).replace('"', "")
# Device on which to run the model
# Set to cuda to load on GPU
device = "cpu"
# Pick a pretrained model
model_name = "slowfast_r50"
model = torch.hub.load("facebookresearch/pytorchvideo:main", model=model_name, pretrained=True)
# Set to eval mode and move to desired device
model = model.to(device)
model = model.eval()
# The duration of the input clip is also specific to the model.
clip_duration = (num_frames * sampling_rate) / frames_per_second
# Load the example video
video_path = "demo.mp4"
# Select the duration of the clip to load by specifying the start and end duration
# The start_sec should correspond to where the action occurs in the video
start_sec = 0
end_sec = start_sec + clip_duration
# Initialize an EncodedVideo helper class
video = EncodedVideo.from_path(video_path)
# Load the desired clip
video_data = video.get_clip(start_sec=start_sec, end_sec=end_sec)
# Apply a transform to normalize the video input
video_data = transform(video_data)
# Move the inputs to the desired device
inputs = video_data["video"]
inputs = [i.to(device)[None, ...] for i in inputs]
# Pass the input clip through the model
preds = model(inputs)
traced_script_module = torch.jit.trace(model, inputs)
#Save the TorchScript model
traced_script_module.save("traced_resnet_model.pt")
I have questions regarding building custom dataset and iterator using torchtext. I used the following code found in this post and modified based on my case:
tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
text_field = Field(sequential=True, eos_token="[CLS]", tokenize=tokenizer)
label_field = Field(sequential=False, use_vocab=False)
data_fields = [("file", None),
("text", text_field),
("label", label_field)]
train, val = train_test_split(input_dt, test_size=0.1)
train.to_csv("train_output_path", index=False)
val.to_csv("val_output_path", index=False)
train, val = TabularDataset(path="path", train="train.csv", validation="val.csv",
format="csv", skip_header=True, fields=data_fields)
When it comes to text_field.build_vocab(train), I got this error: TypeError: '<' not supported between instances of 'list' and 'int'.
The only difference between my code and the post is the pre-trained word embeddings. In the post, the author used glove, which I use XLNetTokenizer from transformers package. I also searched for other posts who used the similar method, but they all used the pre-trained word embeddings, therefore they did have such an issue.
Does anyone know how to fix this issue? Many thanks!
I think as you are using a predefined tokenizer you dont't need to build vocab, instead you can follow this steps. Showing an example of how to do it using BERT tokenizer.
Sentences: it is a list of of text data
lables: is the label associated
###tokenizer = XLNetTokenizer.from_pretrained("xlnet-base-cased")
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
# Tokenize all of the sentences and map the tokens to thier word IDs.
input_ids = []
attention_masks = []
# For every sentence...
for sent in sentences:
# `encode_plus` will:
# (1) Tokenize the sentence.
# (2) Prepend the `[CLS]` token to the start.
# (3) Append the `[SEP]` token to the end.
# (4) Map tokens to their IDs.
# (5) Pad or truncate the sentence to `max_length`
# (6) Create attention masks for [PAD] tokens.
encoded_dict = tokenizer.encode_plus(
sent, # Sentence to encode.
add_special_tokens = True, # Add '[CLS]' and '[SEP]'
max_length = 100, # Pad & truncate all sentences.
pad_to_max_length = True,
return_attention_mask = True, # Construct attn. masks.
return_tensors = 'pt', # Return pytorch tensors.
)
# Add the encoded sentence to the list.
input_ids.append(encoded_dict['input_ids'])
# And its attention mask (simply differentiates padding from non-padding).
attention_masks.append(encoded_dict['attention_mask'])
# Convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(labels)
# Print sentence 0, now as a list of IDs.
print('Original: ', sentences[0])
print('Token IDs:', input_ids[0])
### Not combine the input id , mask and labels and divide the dataset
#:
from torch.utils.data import TensorDataset, random_split
# Combine the training inputs into a TensorDataset.
dataset = TensorDataset(input_ids, attention_masks, labels)
# Create a 90-10 train-validation split.
# Calculate the number of samples to include in each set.
train_size = int(0.90 * len(dataset))
val_size = len(dataset) - train_size
# Divide the dataset by randomly selecting samples.
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])
print('{:>5,} training samples'.format(train_size))
print('{:>5,} validation samples'.format(val_size))
### Not you call loader of these datasets
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
# The DataLoader needs to know our batch size for training, so we specify it
# here. For fine-tuning BERT on a specific task, the authors recommend a batch
# size of 16 or 32.
batch_size = 32
# Create the DataLoaders for our training and validation sets.
# We'll take training samples in random order.
train_dataloader = DataLoader(
train_dataset, # The training samples.
sampler = RandomSampler(train_dataset), # Select batches randomly
batch_size = batch_size # Trains with this batch size.
)
# For validation the order doesn't matter, so we'll just read them sequentially.
validation_dataloader = DataLoader(
val_dataset, # The validation samples.
sampler = SequentialSampler(val_dataset), # Pull out batches sequentially.
batch_size = batch_size # Evaluate with this batch size.
)
I am new to Keras and I have a code for the model part:
# make inputs
self.input_samples = Input(shape=(self.input_shape, ))
self.input_labels = Input(shape=(self.nClass, ))
# Encoder for samples
self.E = self.encoder()(self.input_samples)
# Encoder for labels
self.E_LBLs = self.encoder4lbls()(self.input_labels)
# Decoder for reconstruction
self.D = self.decoder()(self.E)
# Task network
task_net = self.taskOut()
self.T = task_net(self.E)
self.T_LBLS = task_net(self.E_LBLs)
# define GAN for prior matching for samples and labels
self.A = self.adversarial() # This is the discriminator for latent code matching
print(type(self.E))
self.Adv = self.A(concatenate([self.E, self.E_LBLs], axis=0)) # logits for samples and labels
self.A.compile('Adam', loss='binary_crossentropy', metrics=['acc'])
# define MMD loss
# self.merge_embeds = concatenate([self.E, self.E_LBLs], axis=0, name='mmd')
model = Model([self.input_samples, self.input_labels], [self.E, self.E_LBLs, self.Adv])
When I want to output the self.Adv using model.predict([inouts1, inputs2]), it seems the concat operation in concatenate([self.E, self.E_LBLs], axis=0) . always wrong.
The error message is:
res_list = model.predict([trainSamples, trainLabels])
File "/DB/rhome/xchen/anaconda2/envs/Conda_python3_5/lib/python3.5/site-packages/keras/engine/training.py", line 1835, in predict
verbose=verbose, steps=steps)
File "/DB/rhome/xchen/anaconda2/envs/Conda_python3_5/lib/python3.5/site-packages/keras/engine/training.py", line 1339, in _predict_loop
outs[i][batch_start:batch_end] = batch_out
ValueError: could not broadcast input array from shape (64,1) into shape (32,1)
I am sure that self.E and self.E_LBLs are right. And their shapes are [N1x2000] and [N2x2000] respectively.
Do you have any idea? I cannot solve it.
Thanks.
I have a dataframe on which I have built a predictive model. The data is divided to train and test, and I have used Randomforest classifier.
Now, The user pass a new data, which needs to pass through this model and give the result.
It is a text data, and below is the dataframe:
Description Category
Rejoin this domain Network
Laptop crashed Hardware
Installation Error Software
Code :
############### Feature extraction ##############
countvec = CountVectorizer()
counts = countvec.fit_transform(read_data['Description'])
df = pd.DataFrame(counts.toarray())
df.columns = countvec.get_feature_names()
print(df)
########## Join with original data ##############
df = read_data.join(df)
a = list(df.columns.values)
########## Creating the dependent variable class for "Category" variable ###########
factor = pd.factorize(df['Category'])
df.Category = factor[0]
definitions = factor[1]
print(df.Category.head())
print(definitions)
########## Creating the dependent variable class for "Description" variable ###########
factor = pd.factorize(df['Description'])
df.Description = factor[0]
definitions_1 = factor[1]
print(df.Description.head())
print(definitions_1)
######### Split into Train and Test data #######################
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.80, random_state = 21)
############# Random forest classification model #########################
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 42)
classifier.fit(X_train, y_train)
######### Predicting the Test set results ##############
y_pred = classifier.predict(X_test)
#####Reverse factorize (converting y_pred from 0s,1s and 2s to original class for "Category" ###############
reversefactor = dict(zip(range(3),definitions))
y_test = np.vectorize(reversefactor.get)(y_test)
y_pred = np.vectorize(reversefactor.get)(y_pred)
#####Reverse factorize (converting y_pred from 0s,1s and 2s to original class for "Description" ###############
reversefactor = dict(zip(range(53),definitions_1))
X_test = np.vectorize(reversefactor.get)(X_test)
If you only want to do prediction on the user's data, then I would simply load the new csv (or other format) containing the user's data (making sure the columns are the same as in the original training dataset, minus the dependent variable obviously) and you can pull the predictions for your task:
user_df = pd.read_csv("user_data.csv")
#insert a preprocessing step if needed to make sure user_df is identical to the original dataset
new_predictions = classifier.predict(user_df)
I am new in machine learning. I am using SGDClassifier to classify my documents. I trained the model. To persist the trained data I used pickle
code in classify.py for training model
corpus=df2.title_desc #df2 is my dataframe with 2 columns title_desc and category
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix=vectorizer.fit_transform(corpus).todense()
variables = tfidf_matrix
labels = df2.category
variables_train, variables_test, labels_train, labels_test = train_test_split(variables, labels, test_size=0.1)
svm_classifier=linear_model.SGDClassifier(loss='hinge',alpha=0.0001)
svm_classifier=svm_classifier.fit(variables_train, labels_train)
with open('my_dumped_classifier.pkl', 'wb') as fid:
pickle.dump(svm_classifier, fid)
After the data is dumped to a file.I created another py file to test the model
test.py
corpus_test=df_test.title_desc #df_testis my dataframe with 2 columns title_desc and category
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix_test=vectorizer.fit_transform(corpus_test).todense()
svm_classifier=linear_model.SGDClassifier(loss='hinge',alpha=0.0001)
with open('my_dumped_classifier.pkl', 'rb') as fid:
svm_classifier = pickle.load(fid)
tfidf_matrix_test=vectorizer.transform(corpus_test).todense()
svm_predictions=svm_classifier.predict(tfidf_matrix_test)
I am not sure about the logic I have give in test.py. In line
svm_predictions=svm_classifier.predict(tfidf_matrix_test)
its an error 'ValueError: X has 249 features per sample; expecting 1050'
Please give a solution.