How to predict a specific text or group of text using nltk Text Analysis libraries after necessary preprocessing - python-3.x

All code is in python. I have a python list named "corpus" that contains reviews in total 2000(+ve and -ve reviews both). The main/important part of mycode is:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(max_features=2000, max_df=0.6, min_df=3, stop_words=stopwords.words("english"))
X = vectorizer.fit_transform(corpus)
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer()
X = transformer.fit_transform(X)
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)
from sklearn.linear_model import LogisticRegression
logistic_reg = LogisticRegression()
logistic_reg.fit(X_train, y_train)
Now I want to predict a sentence as +ve or -ve('1' or '0'). The sentence is
sample = ["you are a nice person and have a good life"]
How should I go about predicting for the above.(I know what is the role of CountVectorizer and TdfidfTransformer but it is sort of confusing me with the TdfidfVectorizer)

The things you have achieved by CountVectorizer and TfidfTranformer can be achieved by TfidfVecorizer alone.
Answer to your question:
sample = ["you are a nice person and have a good life"]
This is your sample data you want to predict.Here i have used transform method on vectorizer (CountVectorizer)
Count_sample = vectorizer.transform(sample)
After transforming CountVectorizer we have to use transform method on transformer(TfidfTranformer)
Tfidf_sample = transformer.transform(Count_sample)
After completing all the transformation of data use predict function of LogisticRegression
predicted = logistic_reg.predict(Tfidf_sample)

Related

How to get predictions for new data from MultinomialNB?

I'm venturing into a new topic and experimenting with categorising product names. Without deeper knowledge, the use of MultinomialNB (superficially) already yielded quite good results for my use case.
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
df = pd.DataFrame({
'title':['short shirt', 'long shirt','green shoe','cool sneaker','heavy ballerinas'],
'label':['shirt','shirt','shoe','shoe','shoe']
})
count_vec = CountVectorizer()
bow = count_vec.fit_transform(df['title'])
bow = np.array(bow.todense())
X = bow
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)
model = MultinomialNB().fit(X_train, y_train)
model.predict(X_test)
Based on the trainigs of the above simplified example, I would like to categorise completely new titles and output them with the predicted labels:
new = pd.DataFrame({
'title':['long top', 'super shirt','white shoe','super cool sneaker','perfect fit ballerinas'],
'label': np.nan
})
Unfortunately, I am not sure of the next steps and would hope for some support.
...
count_vec = CountVectorizer()
bow = count_vec.fit_transform(new['title'])
bow = np.array(bow.todense())
model.predict(bow)
It's a mistake to fit CountVectorizer on the whole dataset, because the test set should not be used at all during training. This discipline not only follows proper ML principles (to prevent data leakage), it also avoids this practical problem: when the test set is prepared together with the training set, it gets confusing to apply the model to another test set.
The clean way to proceed is to always split the data first between training and test set, this way one is forced to correctly transform the test set independently from the training set. Then it's easy to apply the model on another test set.
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import preprocessing
df = pd.DataFrame({
'title':['short shirt', 'long shirt','green shoe','cool sneaker','heavy ballerinas'],
'label':['shirt','shirt','shoe','shoe','shoe']
})
X = df['title']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)
# 1) Training: use only training set!
# labels should be encoded
le = preprocessing.LabelEncoder()
y_train_enc = le.fit_transform(y_train)
count_vec = CountVectorizer()
X_train_bow = count_vec.fit_transform(X_train)
X_train_bow = np.array(X_train_bow.todense())
model = MultinomialNB().fit(X_train_bow, y_train_enc)
# 2) Testing: apply previous transformation to test set before applying model
X_test_bow = count_vec.transform(X_test)
X_test_bow = np.array(X_test_bow.todense())
y_test_enc = model.predict(X_test_bow)
print("Predicted labels test set 1:",le.inverse_transform(y_test_enc))
# 3) apply to another dataset = just another step of testing, same as above
new = pd.DataFrame({
'title':['long top', 'super shirt','white shoe','super cool sneaker','perfect fit ballerinas'],
'label': np.nan
})
X_new = new['title']
X_new_bow = count_vec.transform(X_new)
X_new_bow = np.array(X_new_bow.todense())
y_new_enc = model.predict(X_new_bow)
print("Predicted labels test set 2:", le.inverse_transform(y_new_enc))
Notes:
This point is not specific to MultinomialNB, this is the correct method for any classifier.
With real data it's often a good idea to use the min_df argument with CountVectorizer, because rare words increase the number of features, don't help predicting the label and cause overfitting.

Different R-squared scores for different times

I just learned about cross-validation and when I give in different arguments, there are different results.
This is the code for building the Regression Model and the R-squared output was about .5 :
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import numpy as np
boston = load_boston()
X = boston.data
y = boston['target']
X_rooms = X[:,5]
X_train, X_test, y_train, y_test = train_test_split(X_rooms, y)
reg = LinearRegression()
reg.fit(X_train.reshape(-1,1), y_train)
prediction_space = np.linspace(min(X_rooms), max(X_rooms)).reshape(-1,1)
plt.scatter(X_test, y_test)
plt.plot(prediction_space, reg.predict(prediction_space), color = 'black')
reg.score(X_test.reshape(-1,1), y_test)
Now when I give the cross-validation for X_train, X_test, and X(respectively), it shows different R-squared values.
Here's the X_test and y_test arguments:
from sklearn.model_selection import cross_val_score
cv = cross_val_score(reg, X_test.reshape(-1,1), y_test, cv = 8)
cv
The result:
array([ 0.42082715, 0.6507651 , -3.35208835, 0.6959869 , 0.7770039 ,
0.59771158, 0.53494622, -0.03308137])
Now when I use the arguments, X_train and y_train, different results are outputted.
from sklearn.model_selection import cross_val_score
cv = cross_val_score(reg, X_train.reshape(-1,1), y_train, cv = 8)
cv
The result:
array([0.46500321, 0.27860944, 0.02537985, 0.72248968, 0.3166983 ,
0.51262191, 0.53049663, 0.60138472])
Now, when I input different arguments again; this time X(which in my case is X_rooms) and y, I yet again get different R-squared values.
from sklearn.model_selection import cross_val_score
cv = cross_val_score(reg, X_rooms.reshape(-1,1), y, cv = 8)
cv
The output:
array([ 0.61748403, 0.79811218, 0.61559222, 0.6475456 , 0.61468198,
-0.7458466 , -3.71140488, -1.17174927])
Which one should I use?
I know this is a long question so Thanks!!
Train set should be distinctly use for training your model, while test set is for final evaluation. But unfortunately, you need to test your model's score on some set before checking it on final result (test set): for example when you try to tune some hyper-parameters. There are some other reasons to use cv, it's just one of them.
Usually the process is:
Split train and test
Train model use cv to check stability, including hyper-tune params (which is irrelevant in your case)
Assess model score on test set.
scikit-learn's cross_val_score receives an object (before training!) and data. It trains each time model on different section of data, and then returns the score. It's like having a lot of "train-test" checks.
Therefore, you should:
from sklearn.model_selection import cross_val_score
reg = LinearRegression()
cv = cross_val_score(reg, X_train.reshape(-1,1), y_train, cv = 8)
solely on train set. Test set should be used for other purposes.
What you get is a list of accuracy score. You can see if your model is stable (does accuracy is in same range among all folds?) or general performance of model (avg score)

Why is my sklearn linear regression model producing perfect predictions?

I'm trying to do multiple linear regression with sklearn and I have performed the following steps. However, when it comes to predicting y_pred using the trained model I am getting a perfect r^2 = 1.0. Does anyone know why this is the case/what's going wrong with my code?
Also sorry I'm new to this site so I'm not fully up to speed with the formatting/etiquette of questions!
import numpy as np
import pandas as pd
# Import and subset data
ml_data_all = pd.read_excel('C:/Users/User/Documents/RSEM/STADM/Coursework/Crime_SF/Machine_learning_collated_data.xlsx')
ml_data_1218 = ml_data_all[ml_data_all['Year'] >= 2012]
ml_data_1218.drop(columns=['Pop_MOE',
'Pop_density_MOE',
'Age_median_MOE',
'Sex_ratio_MOE',
'Income_median_household_MOE',
'Pop_total_pov_status_determ_MOE',
'Pop_total_50percent_pov_MOE',
'Pop_total_125percent_pov_MOE',
'Poverty_percent_below_MOE',
'Total_labourforceMOE',
'Unemployed_total_MOE',
'Unemployed_total_male_MOE'], inplace=True)
# Taking care of missing data
# Delete rows containing any NaNs
ml_data_1218.dropna(axis=0,
how='any',
inplace=True)
# DATA PREPROCESSING
# Defining X and y
X = ml_data_1218.drop(columns=['Year']).values
y = ml_data_1218['Burglaries '].values
# Encoding categorical data
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
transformer = ColumnTransformer(transformers=[("cat", OneHotEncoder(), [0])], remainder='passthrough')
X = transformer.fit_transform(X)
X.toarray()
X = pd.DataFrame.sparse.from_spmatrix(X)
# Split into Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# Feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train.iloc[:,149:] = sc_X.fit_transform(X_train.iloc[:,149:])
X_test.iloc[:,149:] = sc_X.transform(X_test.iloc[:,149:])
# Fitting multiple linear regression to training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Predicting test set results
y_pred = regressor.predict(X_test)
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)
So turns out it was a stupid mistake in the end: I forgot to drop the dependent variable (Burglaries) from the X columns haha, hence why the linear regression model was making perfect predictions. Now it's working (r2 = 0.56). Thanks everyone!
With regression, it's often a good idea to run a correlation matrix against all of your variables (IVs and the DV). Regression likes parsimony, so removing IVs that are functionally the same (and just leaving one in the model) is better for R^2 value (aka model fit). Also, if something is correlated at .97 or higher with the DV, it is basically a substitute for the DV and all the other data is most likely superfluous.
When reading your issue (before I saw your "Answer") I was thinking "either this person has outrageous correlation issues or the DV is also in the prediction data."

Why do my ML models have horrible accuracy?

I am new to machine learning and I am building my first model independently. I have a dataset that evaluates cars, it contains features of price, safety and luxury and classifies if its good, very good, acceptable and unacceptable. I converted all the non-numeric columns into numeric, trained the data and predicted with a test set. However, my predictions are awful; I used LinearRegression and r2_score outputs 0.05 which is practically 0. I have tried a few different models and all have been giving me horrible predictions and accuracy.
What am I doing wrong? I have seen tutorials, read articles with similar methodology, yet they end up with 0.92 accuracy and I'm getting 0.05. How do you make a good model for your data and how do you know which model to use?
Code:
import numpy as np
import pandas as pd
from sklearn import preprocessing, linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class value']
df = pd.read_csv('car.data.txt', index_col=False, names=columns)
for col in df.columns.values:
try:
if df[col].astype(int):
pass
except ValueError:
enc = preprocessing.LabelEncoder()
enc.fit(df[col])
df[col] = enc.transform(df[col])
#Split the data
class_y = df.pop('class value')
x_train, x_test, y_train, y_test = train_test_split(df, class_y, test_size=0.2, random_state=0)
#Make the model
regression_model = linear_model.LinearRegression()
regression_model = regression_model.fit(x_train, y_train)
#Predict the test data
y_pred = regression_model.predict(x_test)
score = r2_score(y_test, y_pred)
You should not use Linear Regression, which is used for predicting continuous values and not categorical values. In your case what you are trying to predict is categorical. Technically, each situation is a class.
I would suggest trying Logistic Regression or other type of classification methods such as Naive Bayes, SVM , decision tree classifiers etc. instead.

trying a custom computation of grid.best_score_ (obtained with GridSearchCV)

I'm trying to recompute grid.best_score_ I obtained on my own data without success...
So I tried it using a conventional dataset but no more success. Here is the code :
from sklearn import datasets
from sklearn import linear_model
from sklearn.cross_validation import ShuffleSplit
from sklearn import grid_search
from sklearn.metrics import r2_score
import numpy as np
lr = linear_model.LinearRegression()
boston = datasets.load_boston()
target = boston.target
param_grid = {'fit_intercept':[False]}
cv = ShuffleSplit(target.size, n_iter=5, test_size=0.30, random_state=0)
grid = grid_search.GridSearchCV(lr, param_grid, cv=cv)
grid.fit(boston.data, target)
# got cv score computed by gridSearchCV :
print grid.best_score_
0.677708680059
# now try a custom computation of cv score
cv_scores = []
for (train, test) in cv:
y_true = target[test]
y_pred = grid.best_estimator_.predict(boston.data[test,:])
cv_scores.append(r2_score(y_true, y_pred))
print np.mean(cv_scores)
0.703865991851
I can't see why it's different, GridSearchCV is supposed to use scorer from LinearRegression, which is r2 score. Maybe the way I code cv score is not the one used to compute best_score_... I'm asking here before going through GridSearchCV code.
Unless refit=False in the GridSearchCV constructor, the winning estimator is refit on the entire dataset at the end of fit. best_score_ is the estimator's average score using the cross-validation splits, while best_estimator_ is an estimator of the winning configuration fit on all the data.
lr2 = linear_model.LinearRegression(fit_intercept=False)
scores2 = [lr2.fit(boston.data[train,:], target[train]).score(boston.data[test,:], target[test])
for train, test in cv]
print np.mean(scores2)
Will print 0.67770868005943297.

Resources