Predicting new description with existing logistic regression classification model - scikit-learn

I am doing everything only on one jupyter notebook file.
I am trying to predict new store description by their category using logistic regression classification model and count vectorizer
All the code below are in SEQUENCE be it used or unused code
Below is my code:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(stop_words='english', ngram_range=(1,1))
X_train_cv=cv.fit_transform(X_train.values.astype('str'))
X_test_cv=cv.transform(X_test.values.astype('str'))
from sklearn.linear_model import LogisticRegression
lr=LogisticRegression(solver='lbfgs')
lr.fit(X_train_cv,y_train)
y_pred_cv=lr.predict(X_test_cv)
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred_cv,target_names=['electronics','fashion','F&B','services']))
#i never use this code below as i am not doing on 2 notebook
import pickle
from datetime import datetime
model_path=['drive','mydrive','I125','models']
time=datetime.now().strfttime("%Y-%m-%d")
filename='lr-{}.pkl'.format(time)
templist=[]
templist.append(filename)
path1=os.sep.join(model_path+templist)
filename='countvectorizer-{}.pkl'.format(time)
templist=[]
templist.append(filename)
path2=os.sep.join(model_path+templist)
with open(path1,'wb')as f1:
pickle.dump(lr,f1)
with open(path2,'wb')as f2:
pickle.dump(cv,f2)
I am trying to predict a new description using the current classifier that i have. I only know how to use current classifier to predict new description if it's for separate notebook.
This is my code that i have to predict for new description:
#i never use this code below as i am not doing on 2 notebook
import os
import pickle
from google.colab import drive
drive.mount('/content/drive')
model_path=['drive','mydrive','I125','models']
filename=['lr-2022-10-10.pk1']
model_path=['drive','mydrive','I125','models']
filename=['countvectoriser-2022-10-10.pk1']
path2=os.sep.join(model_path+filename)
with open(path2,'rb')as f:
trained_cv=pickle.load(f)
path1=os.sep.join(model_path+filename)
with open(path1,'rb') as f:
model=pickle.load(f)
#i used this code below
import re
import string
def preprocess(text):
pattern_alphanumeric="\w*\d\w*"
pattern_punctuation="["+re.escape(string.punctuation)+"]"
text=re.sub(pattern_alphanumeric,'',text)
text=re.sub(pattern_punctuation,'',text).lower()
return text
new_text="This clothes so nice"
new_text_processed=preprocess(new_text)
def encode_text_to_vector(cv,test):
text_vector = cv.transform([text])
return text_vector
new_text_vector=encode_text_to_vector(trained_cv,new_text_processed) <--line with error
print(new_text_vector)
ERror:
trained_cv is undefined. (trained_cv is supposed to be the the saved logistic regression and count vectorizer if i have use different jupyter notebook)

Related

the right way to make prediction using Spacy word vectors

Im learning how to convert text into numbers for NLP problems and following a course Im learning about word vectors provided by Spacy package. the code works all fine from learning and evaluation but I have some problems regarding:
making prediction for new sentences, I cannot seems to make it work and most examples just fit the model then use X_test set for evaluation. ( Code below)
The person explaining stated that its bad( won't give good results) if I used
""
doc.vector over doc.vector.values
""
when trying both I don't see a difference, what is the difference between the two?
the example is to classify news title between fake and real
import spacy
import pandas as pd
df= pd.read_csv('Fake_Real_Data.csv')
print(df.head())
print(f"shape is: {df.shape}")
print("checking the impalance: \n ", df.label.value_counts())
df['label_No'] = df['label'].map({'Fake': 0, 'Real': 1})
print(df.head())
nlp= spacy.load('en_core_web_lg') # only large and medium model have word vectors
df['Text_vector'] = df['Text'].apply(lambda x: nlp(x).vector) #apply the function to EACH element in the column
print(df.head(5))
from sklearn.model_selection import train_test_split
X_train, X_test, y_train,y_test= train_test_split(df.Text_vector.values, df.label_No, test_size=0.2, random_state=2022)
x_train_2D= np.stack(X_train)
x_test_2D= np.stack(X_test)
from sklearn.naive_bayes import MultinomialNB
clf=MultinomialNB()
from sklearn.preprocessing import MinMaxScaler
scaler= MinMaxScaler()
scaled_train_2d= scaler.fit_transform(x_train_2D)
scaled_test_2d= scaler.transform(x_test_2D)
clf.fit(scaled_train_2d, y_train)
from sklearn.metrics import classification_report
y_pred=clf.predict(scaled_test_2d)
print(classification_report(y_test, y_pred))

Do I have to run tsne.fit_transform for each set of embeddings that I want to visualize?

I'm trying to use sklearn.manifold.TSNE to visualize data that I sample from a generative model and compare the distribution of generated data vs training data (to measure 'extrapolation').
Here's how I'm doing it:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import joblib
import numpy as np
import pandas as pd
tsne = TSNE(n_components=2, random_state=0)
x_train = tsne.fit_transform(embds_train)
x_generated = tsne.fit_transform(embds_generated)
My question is, is it necessary to call tsne.fit_transform() on both embeddings for training and generated samples? Or I could fit only once and then add other embeddings to already fitted space?

Loading trained model to make prediction of single image

I have trained a ResNet50 model on intel image multiclass classification task. The task is trying to predict an image whether it is a building a street or glacier etc. The model is succesfully trained and able to make prediction. I have save the model and trying to use the saved model on new image.
Here is the code on training
import os
import torch
import tarfile
import torchvision
import torch.nn as nn
from PIL import Image
import matplotlib.pyplot as plt
import torch.nn.functional as F
from torchvision import transforms
from torchvision.utils import make_grid
from torch.utils.data import random_split
from torchvision.transforms import ToTensor
from torchvision.datasets import ImageFolder
from torch.utils.data import Dataset, DataLoader
from torchvision.datasets.utils import download_url
import PIL
import PIL.Image
import numpy as np
transform_train=transforms.Compose([
transforms.Resize((150,150)),
transforms.RandomHorizontalFlip(),
transforms.RandomVerticalFlip(),
transforms.ToTensor(),
transforms.Normalize((.5,.5,.5),(.5,.5,.5))
])
transform_test=transforms.Compose([
transforms.Resize((150,150)),
transforms.ToTensor(),
transforms.Normalize((.5,.5,.5),(.5,.5,.5))
])
...
torch.save(model2.state_dict(),'/content/drive/MyDrive/saved_model/model_resnet.pth')
When I called the model in other files, I use similar image transformation, however it gives me an error, here is the code and the error
model = torch.load('/content/drive/MyDrive/saved_model/model_resnet.pth')
image=Image.open(Path('/content/drive/MyDrive/images/seg_pred/seg_pred/10004.jpg'))
transform_train=transforms.Compose([
transforms.Resize((150,150)),
transforms.RandomHorizontalFlip(),
transforms.RandomVerticalFlip(),
transforms.ToTensor(),
transforms.Normalize((.5,.5,.5),(.5,.5,.5))
])
input = transform_train(image)
#input = input.view(1, 3, 150,150)
output = model(input)
prediction = int(torch.max(output.data, 1)[1].numpy())
print(prediction)
The error that gives me is
TypeError: 'collections.OrderedDict' object is not callable
My pytorch version is
1.9.0+cu102
You need to create the structure of the model first, it's similar to create model2 on your training code, it can be like:
model = resnet()
Then load the saved state dict:
model.load_state_dict(torch.load('/content/drive/MyDrive/saved_model/model_resnet.pth'))
model.eval()
Ref:
https://pytorch.org/tutorials/beginner/saving_loading_models.html
Based on your question it's clear that you want to prediction on a new image. But you are trying to augment and get transform the image using transform which is not a proper way to get the prediction.
So as the code link you provided having plenty of code you can use them as in your code.
I am sharing the fast.ai and simple `TensorFlow code by which you can predict a new image and then be able to see the result.
img = open_image('any_image.jpg')
print(learn.predict(img)[0])
OR you can try this function:
import matplotlib.pyplot as plt # visualization
import matplotlib.image as mpimg
import tensorflow as tf # Deep Learning Framework
import pathlib
def pred_plot(file, model, class_names=class_names, image_size=(150, 150)):
img = tf.io.read_file(file)
img = tf.io.decode_image(img, channels=3)
img = tf.image.resize(img, size=image_size)
pred_probs = model.predict(tf.expand_dims(img, axis=0))
pred_class = class_names[pred_probs.argmax()]
plt.imshow(img/225.)
plt.title(f'Pred: {pred_class}')
plt.axis(False);
pass any image and you will get the prediction with visilzation.
url ='dummy.jpg'
pred_plot(url, model=model_2, class_names=class_names)

Unsatisfactory output from Tf-Idf

I have a document in a text file in 2 lines as shown below. I wanted to apply tf-idf to it and I get the error as shown below, I am not sure where is int object in my file? why would it throw this error?
Env:
Jupter notebook, python 3.7
Error:
AttributeError: 'int' object has no attribute 'lower'
file.txt:
Random person from the random hill came to a running mill and I have a count of the hill. This is my house.
A person is from a great hill and he loves to run a mill.
Sub-disciplines of biology are defined by the research methods employed and the kind of system studied: theoretical biology uses mathematical methods to formulate quantitative models while experimental biology performs empirical experiments.
The objects of our research will be the different forms and manifestations of life, the conditions and laws under which these phenomena occur, and the causes through which they have been effected. The science that concerns itself with these objects we will indicate by the name biology.
Code:
import pandas as pd
import spacy
import csv
import collections
import sys
import itertools
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
from nltk.tokenize import sent_tokenize
from gensim import corpora, models
from stop_words import get_stop_words
from nltk.stem import PorterStemmer
data = pd.read_csv('file.txt', sep="\n", header=None)
data.dtypes
0 object
dtype: object
data.shape()
4, 1
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(data)
print(X)
I solved it by reading the file like this:
with open('file.txt') as f:
lines = [line.rstrip() for line in f]

Getting error while trying to save and apply existing machine learning model to new dataset?

I am trying to use this model https://github.com/aninda052/Disasters-on-social-media-NLP/blob/master/Disasters%20on%20social%20media.ipynb
, I searched for a way to save this model and use it with new dataset in other application an I find out use pickle, and I add this to code like this
import pickle
model_tfidf=LogisticRegression( C=30.0,class_weight='balanced', solver='newton-cg',
multi_class='multinomial', n_jobs=-1, random_state=5)
model_tfidf.fit(x_train_tfidf, y_train)
predicted_tfidf=model_tfidf.predict(x_test_tfidf)
Pkl_Filename = "Pickle_RL_Model.pkl"
with open(Pkl_Filename, 'wb') as file:
pickle.dump(model_tfidf, file)
after that I tried to create new project to load and use this model and the code is:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd
import pickle
with open('Pickle_RL_Model.pkl', 'rb') as file:
Pickled_LR_Model = pickle.load(file)
x=["hi disaster","flood disaster","cry sad bad ","srong storm"]
tfd=TfidfVectorizer()
new_data_vec=tfd.fit_transform(x)
Ypredict = Pickled_LR_Model.predict(new_data_vec)
but I got error said:
X has 8 features per sample; expecting 16988
I don't know what I did wrong, any help please.

Resources