How to get predictions for new data from MultinomialNB? - scikit-learn

I'm venturing into a new topic and experimenting with categorising product names. Without deeper knowledge, the use of MultinomialNB (superficially) already yielded quite good results for my use case.
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
df = pd.DataFrame({
'title':['short shirt', 'long shirt','green shoe','cool sneaker','heavy ballerinas'],
'label':['shirt','shirt','shoe','shoe','shoe']
})
count_vec = CountVectorizer()
bow = count_vec.fit_transform(df['title'])
bow = np.array(bow.todense())
X = bow
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)
model = MultinomialNB().fit(X_train, y_train)
model.predict(X_test)
Based on the trainigs of the above simplified example, I would like to categorise completely new titles and output them with the predicted labels:
new = pd.DataFrame({
'title':['long top', 'super shirt','white shoe','super cool sneaker','perfect fit ballerinas'],
'label': np.nan
})
Unfortunately, I am not sure of the next steps and would hope for some support.
...
count_vec = CountVectorizer()
bow = count_vec.fit_transform(new['title'])
bow = np.array(bow.todense())
model.predict(bow)

It's a mistake to fit CountVectorizer on the whole dataset, because the test set should not be used at all during training. This discipline not only follows proper ML principles (to prevent data leakage), it also avoids this practical problem: when the test set is prepared together with the training set, it gets confusing to apply the model to another test set.
The clean way to proceed is to always split the data first between training and test set, this way one is forced to correctly transform the test set independently from the training set. Then it's easy to apply the model on another test set.
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import preprocessing
df = pd.DataFrame({
'title':['short shirt', 'long shirt','green shoe','cool sneaker','heavy ballerinas'],
'label':['shirt','shirt','shoe','shoe','shoe']
})
X = df['title']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)
# 1) Training: use only training set!
# labels should be encoded
le = preprocessing.LabelEncoder()
y_train_enc = le.fit_transform(y_train)
count_vec = CountVectorizer()
X_train_bow = count_vec.fit_transform(X_train)
X_train_bow = np.array(X_train_bow.todense())
model = MultinomialNB().fit(X_train_bow, y_train_enc)
# 2) Testing: apply previous transformation to test set before applying model
X_test_bow = count_vec.transform(X_test)
X_test_bow = np.array(X_test_bow.todense())
y_test_enc = model.predict(X_test_bow)
print("Predicted labels test set 1:",le.inverse_transform(y_test_enc))
# 3) apply to another dataset = just another step of testing, same as above
new = pd.DataFrame({
'title':['long top', 'super shirt','white shoe','super cool sneaker','perfect fit ballerinas'],
'label': np.nan
})
X_new = new['title']
X_new_bow = count_vec.transform(X_new)
X_new_bow = np.array(X_new_bow.todense())
y_new_enc = model.predict(X_new_bow)
print("Predicted labels test set 2:", le.inverse_transform(y_new_enc))
Notes:
This point is not specific to MultinomialNB, this is the correct method for any classifier.
With real data it's often a good idea to use the min_df argument with CountVectorizer, because rare words increase the number of features, don't help predicting the label and cause overfitting.

Related

How to improve negative R square in DecisionTree Regressor

I was trying to apply some regressor to make an IMDB rating predict. This is what I tried:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
data = pd.read_csv("D:/Code/imdb_project/movie_metadata.csv")
df = data[["duration","budget", "title_year","imdb_score"]]
df = df.dropna()
feature = np.array(df[["duration","budget","title_year"]])
rating = np.array(df["imdb_score"])
scaler = MinMaxScaler()
scaler.fit(feature)
X = scaler.transform(feature)
y = rating
x_train, x_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, test_size = 0.2, random_state = 5)
regressor = DecisionTreeRegressor(criterion='mse')
regressor.fit(x_train, y_train)
regressor.score(x_test, y_test)
For clarification, my dataset contains 3 features: Budge, Release year, and duration, y is the IMDB rating.
When applying this regressor for the test data, I always receive a negative R square (it works just fine with the train data.) I understand that R square can be negative but I am still wondering if there is a way that I can improve it? The only way I know is normalizing the data and I did it before fitting the model.
Negative R^2 score means your model fits the data very poorly. In this case Decision tree may be too simple. Or maybe you've chosen wrong criterion.
I would recommend to try tune your model's hyperparameters or choose another one.

Why is my sklearn linear regression model producing perfect predictions?

I'm trying to do multiple linear regression with sklearn and I have performed the following steps. However, when it comes to predicting y_pred using the trained model I am getting a perfect r^2 = 1.0. Does anyone know why this is the case/what's going wrong with my code?
Also sorry I'm new to this site so I'm not fully up to speed with the formatting/etiquette of questions!
import numpy as np
import pandas as pd
# Import and subset data
ml_data_all = pd.read_excel('C:/Users/User/Documents/RSEM/STADM/Coursework/Crime_SF/Machine_learning_collated_data.xlsx')
ml_data_1218 = ml_data_all[ml_data_all['Year'] >= 2012]
ml_data_1218.drop(columns=['Pop_MOE',
'Pop_density_MOE',
'Age_median_MOE',
'Sex_ratio_MOE',
'Income_median_household_MOE',
'Pop_total_pov_status_determ_MOE',
'Pop_total_50percent_pov_MOE',
'Pop_total_125percent_pov_MOE',
'Poverty_percent_below_MOE',
'Total_labourforceMOE',
'Unemployed_total_MOE',
'Unemployed_total_male_MOE'], inplace=True)
# Taking care of missing data
# Delete rows containing any NaNs
ml_data_1218.dropna(axis=0,
how='any',
inplace=True)
# DATA PREPROCESSING
# Defining X and y
X = ml_data_1218.drop(columns=['Year']).values
y = ml_data_1218['Burglaries '].values
# Encoding categorical data
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
transformer = ColumnTransformer(transformers=[("cat", OneHotEncoder(), [0])], remainder='passthrough')
X = transformer.fit_transform(X)
X.toarray()
X = pd.DataFrame.sparse.from_spmatrix(X)
# Split into Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# Feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train.iloc[:,149:] = sc_X.fit_transform(X_train.iloc[:,149:])
X_test.iloc[:,149:] = sc_X.transform(X_test.iloc[:,149:])
# Fitting multiple linear regression to training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Predicting test set results
y_pred = regressor.predict(X_test)
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)
So turns out it was a stupid mistake in the end: I forgot to drop the dependent variable (Burglaries) from the X columns haha, hence why the linear regression model was making perfect predictions. Now it's working (r2 = 0.56). Thanks everyone!
With regression, it's often a good idea to run a correlation matrix against all of your variables (IVs and the DV). Regression likes parsimony, so removing IVs that are functionally the same (and just leaving one in the model) is better for R^2 value (aka model fit). Also, if something is correlated at .97 or higher with the DV, it is basically a substitute for the DV and all the other data is most likely superfluous.
When reading your issue (before I saw your "Answer") I was thinking "either this person has outrageous correlation issues or the DV is also in the prediction data."

Tensorflow classifier.evaluate running indefinitely?

I've started trying out some of the Tensorflow API's. I am using the iris data set to experiment with Tensorflows Estimator's. I'm loosely following this tutorial except that I load my data in a little differently: https://www.tensorflow.org/guide/premade_estimators#top_of_page
My problem is that when the code below executes and I get to the section with:
# Evaluate the model.
eval_result = classifier.evaluate(
My computer just runs seemingly without end. I've been waiting for my jupyter notebook to complete this step now for an hour and a half but no end in sight. The lastoutput of the notebook is:
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Problem statement: How can I adjust my code to make this evaluation more efficient I'm obviously making it do much more work than I anticipated.
So far I have tried adjusting the batch size and the number or neurons in the layers but with no luck.
#First we want to import what we need. Typically this will be some combination of:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
%matplotlib inline
#Extract the data from the iris dataset.
df = pd.read_csv('IRIS.csv')
le = LabelEncoder()
df['species'] = le.fit_transform(df['species'])
#Extract both into features and labels.
#features should be a dictionary.
#label can just be an array
def extract_features_and_labels(dataframe):
#features and label for training
x = dataframe.copy()
y = dataframe.pop('species')
return dict(x), y
#break the data up into train and test.
#split the overall df into training and testing data
train, test = train_test_split(df, test_size=0.2)
train_x, train_y = extract_features_and_labels(train)
test_x, test_y = extract_features_and_labels(test)
print(len(train_x), 'training examples')
print(len(train_y), 'testing examples')
my_feature_columns = []
for key in train_x.keys():
my_feature_columns.append(tf.feature_column.numeric_column(key=key))
def train_input_fn(features, labels, batch_size):
"""An input function for training"""
# Convert the inputs to a Dataset.
dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))
# Shuffle, repeat, and batch the examples.
return dataset.shuffle(1000).repeat().batch(batch_size)
#Build the classifier!!!!
classifier = tf.estimator.DNNClassifier(
feature_columns=my_feature_columns,
# Two hidden layers of 10 nodes each.
hidden_units=[4, 4],
# The model must choose between 3 classes.
n_classes=3)
# Train the Model.
classifier.train(
input_fn=lambda:train_input_fn(train_x, train_y, 10), steps=1000)
# Evaluate the model.
eval_result = classifier.evaluate(
input_fn=lambda:train_input_fn(test_x, test_y, 100))
print('\nTest set accuracy: {accuracy:0.3f}\n'.format(**eval_result))
Had to switch a lot of things up but finally got the estimator working on the IRIS data set. Here is the code below for any who may find it useful in the future. Cheers.
#First we want to import what we need. Typically this will be some combination of:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing
%matplotlib inline
#Extract the data from the iris dataset.
df = pd.read_csv('IRIS.csv')
#Grab only our categorial data
#categories = df.select_dtypes(include=[object])
le = LabelEncoder()
df['species'] = le.fit_transform(df['species'])
# use df.apply() to apply le.fit_transform to all columns
#X_2 = X.apply(le.fit_transform)
#Reshape as the one hot encoder doesnt like one row/column
#X_2 = X.reshape(-1, 1)
#features is everything but our label so makes sense to simply....
features = df.drop('species', axis=1)
print(features.head())
target = df['species']
print(target.head())
#trains and test
X_train, X_test, y_train, y_test = train_test_split(
features, target, test_size=0.33, random_state=42)
#Introduce Tensorflow feature column (numeric column)
numeric_column = ['sepal_length','sepal_width','petal_length','petal_width']
numeric_features = [tf.feature_column.numeric_column(key = column)
for column in numeric_column]
print(numeric_features[0])
#Build the input function for training
training_input_fn = tf.estimator.inputs.pandas_input_fn(x = X_train,
y=y_train,
batch_size=10,
shuffle=True,
num_epochs=None)
#Build the input function for testing input
eval_input_fn = tf.estimator.inputs.pandas_input_fn(x=X_test,
y=y_test,
batch_size=10,
shuffle=False,
num_epochs=1)
#Instansiate the model
dnn_classifier = tf.estimator.DNNClassifier(feature_columns=numeric_features,
hidden_units=[3,3],
optimizer=tf.train.AdamOptimizer(1e-4),
n_classes=3,
dropout=0.1,
model_dir = "dnn_classifier")
dnn_classifier.train(input_fn = training_input_fn,steps=2000)
#Evaluate the trained model
dnn_classifier.evaluate(input_fn = eval_input_fn)
# print("Loss is " + str(loss))
pred = list(dnn_classifier.predict(input_fn = eval_input_fn))
for e in pred:
print(e)
print("\n")
#pred = [p['species'][0] for p in pred]

Why do my ML models have horrible accuracy?

I am new to machine learning and I am building my first model independently. I have a dataset that evaluates cars, it contains features of price, safety and luxury and classifies if its good, very good, acceptable and unacceptable. I converted all the non-numeric columns into numeric, trained the data and predicted with a test set. However, my predictions are awful; I used LinearRegression and r2_score outputs 0.05 which is practically 0. I have tried a few different models and all have been giving me horrible predictions and accuracy.
What am I doing wrong? I have seen tutorials, read articles with similar methodology, yet they end up with 0.92 accuracy and I'm getting 0.05. How do you make a good model for your data and how do you know which model to use?
Code:
import numpy as np
import pandas as pd
from sklearn import preprocessing, linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class value']
df = pd.read_csv('car.data.txt', index_col=False, names=columns)
for col in df.columns.values:
try:
if df[col].astype(int):
pass
except ValueError:
enc = preprocessing.LabelEncoder()
enc.fit(df[col])
df[col] = enc.transform(df[col])
#Split the data
class_y = df.pop('class value')
x_train, x_test, y_train, y_test = train_test_split(df, class_y, test_size=0.2, random_state=0)
#Make the model
regression_model = linear_model.LinearRegression()
regression_model = regression_model.fit(x_train, y_train)
#Predict the test data
y_pred = regression_model.predict(x_test)
score = r2_score(y_test, y_pred)
You should not use Linear Regression, which is used for predicting continuous values and not categorical values. In your case what you are trying to predict is categorical. Technically, each situation is a class.
I would suggest trying Logistic Regression or other type of classification methods such as Naive Bayes, SVM , decision tree classifiers etc. instead.

How to predict a specific text or group of text using nltk Text Analysis libraries after necessary preprocessing

All code is in python. I have a python list named "corpus" that contains reviews in total 2000(+ve and -ve reviews both). The main/important part of mycode is:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(max_features=2000, max_df=0.6, min_df=3, stop_words=stopwords.words("english"))
X = vectorizer.fit_transform(corpus)
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer()
X = transformer.fit_transform(X)
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)
from sklearn.linear_model import LogisticRegression
logistic_reg = LogisticRegression()
logistic_reg.fit(X_train, y_train)
Now I want to predict a sentence as +ve or -ve('1' or '0'). The sentence is
sample = ["you are a nice person and have a good life"]
How should I go about predicting for the above.(I know what is the role of CountVectorizer and TdfidfTransformer but it is sort of confusing me with the TdfidfVectorizer)
The things you have achieved by CountVectorizer and TfidfTranformer can be achieved by TfidfVecorizer alone.
Answer to your question:
sample = ["you are a nice person and have a good life"]
This is your sample data you want to predict.Here i have used transform method on vectorizer (CountVectorizer)
Count_sample = vectorizer.transform(sample)
After transforming CountVectorizer we have to use transform method on transformer(TfidfTranformer)
Tfidf_sample = transformer.transform(Count_sample)
After completing all the transformation of data use predict function of LogisticRegression
predicted = logistic_reg.predict(Tfidf_sample)

Resources