Why train_test_split influences the results of regression so drastically? - python-3.x

I am performing some tests on the house prices prediction competition of Kaggle.
For easiness, find here below the complete process to download, pre-process and start predicting using a simple linear regression model :
download the data
from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate()
saveDir = "data"
if not os.path.exists("data"):
os.makedirs(saveDir)
api.competition_download_files("house-prices-advanced-regression-techniques","data")
print("the following files have been downloaded \n" + '\n'.join('{}'.format(item) for item in os.listdir("data")))
print("they are located in " + saveDir)
get train and test data
train = pd.read_csv(saveDir + r"\train.csv")
test = pd.read_csv(saveDir + r"\test.csv")
xTrain = train.iloc[:,1:-1] # remove id & SalePrice
yTrain = train.iloc[:,-1] # SalePrice
xTest = test.iloc[:,1:] # remove id
split numerical and test data
catData = xTrain.columns[xTrain.dtypes == object]
numData = list(set(xTrain.columns) - set(catData))
print("The number of columns in the original dataframe is " + str(len(xTrain.columns)))
print("The number of columns in the categorical and numerical data dds up to " + str(len(catData)+len(numData)))
Define a cleaning function to deal with NaN / None
def cleanData(data, catData, numData) :
dataClean = data.copy()
# Let's deal with NaN ...
# check where there are NaN in categorical
dataClean[catData].columns[dataClean[catData].isna().any(axis=0)]
# take care that some categorical could be numerics so
# differentiate the two cases
dataTypes = [dataClean.loc[dataClean.loc[:,col].notnull(),col].apply(type).iloc[0] for col in catData] # get the data type for each column
# have to be carefull to not take a data that is NaN or None
# when evaluating its type
from itertools import compress
catDataNum = [True if ((col == float) | (col == int)) else False for col in dataTypes ] # if data type is numeric (float/int), register it
catDataNum = list(compress(catData, catDataNum))
catDataNotNum = list(set(catData)-set(catDataNum))
print("The number of columns in the dataframe is " + str(len(dataClean.columns)))
print("The number of columns in the categorical and numerical data dds up to " +
str(len(catDataNum) + len(catDataNotNum)+len(numData)))
# Check what NA means for each feature ...
# BsmtQual : NA means no basement
# GarageType : NA means no garage
# BsmtExposure : NA means no basement
# Alley : NA means no alley access
# BsmtFinType2 : NA means no basement
# GarageFinish : NA means no garage
# did not check the rest ... I will just replace with a category "No"
# For categorical, NaN values mean the considered feature
# do not exist (this requires dataset analysis as performed above)
dataClean[catDataNotNum] = dataClean[catDataNotNum].fillna(value = 'No')
mean = dataClean[catDataNum].mean()
dataClean[catDataNum] = dataClean[catDataNum].fillna(value = mean)
# for numerical, replace with mean
mean = dataClean[numData].mean()
dataClean[numData] = dataClean[numData].fillna(value = mean)
return dataClean
Perform the cleaning
xTrainClean = cleanData(xTrain, catData, numData)
# check if no NaN or None anymore
if xTrainClean.isna().sum().sum() != 0:
print(xTrainClean.iloc[:,xTrainClean.isna().any(axis=0).values])
else :
print("All good! No more NaN or None in training data!")
# same with test data
# perform the cleaning
xTestClean = cleanData(xTest, catData, numData)
# check if no NaN or None anymore
if xTestClean.isna().sum().sum() != 0:
print(xTestClean.iloc[:,xTestClean.isna().any(axis=0).values])
else :
print("All good! No more NaN or None in test data!")
Pre-process data i.e. one-hot encode the categorical features
import sklearn as sk
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
# We would like to perform a linear regression on all data
# but some data are categorical ...
# so first, perform a one-hot encoding on categorical variables
ct = ColumnTransformer(transformers = [("OneHotEncoder", OneHotEncoder(categories='auto', drop=None,
sparse=False, n_values='auto',
handle_unknown = "error"),
catData)],
remainder = "passthrough")
ct = ct.fit(pd.concat([xTrainClean, xTestClean])) # fit on both xTrain & xTest to be sure to have all possible categorical values
# test it separately (.fit(xTrain) / .fit(xTest) and analyze to understand)
# resulting categories and values can be obtained through
# ct.named_transformers_ ['OneHotEncoder'].categories_
xTrainOneHot = ct.transform(xTrainClean)
Split training data into an "internal" training and test sets
xTestOneHotKaggle = xTestOneHot.copy()
from sklearn.model_selection import train_test_split
xTrainInternalOneHot, xTestInternalOneHot, yTrainInternal, yTestInternal = train_test_split(xTrainOneHot, yTrain, test_size=0.5, random_state=42, shuffle = False)
print("The training data now contains " + str(xTrainInternalOneHot.shape[0]) + " samples")
print("The training data now contains " + str(yTrainInternal.shape[0]) + " labels")
print("The test data now contains " + str(xTestInternalOneHot.shape[0]) + " samples")
print("The test data now contains " + str(yTestInternal.shape[0]) + " labels")
Train ... And this is where the funny part is
reg = LinearRegression().fit(xTrainInternalOneHot,yTrainInternal)
yTrainInternalPredict = reg.predict(xTrainInternalOneHot)
yTestInternalPredict = reg.predict(xTestInternalOneHot)
print("The R2 score on training data is equal to " + str(reg.score(xTrainInternalOneHot,yTrainInternal)))
print("The R2 score on the internal test data is equal to " + str(reg.score(xTestInternalOneHot, yTestInternal)))
from sklearn.metrics import mean_squared_log_error
print("Tke Kaggle metric score (RMSLE) on internal training data is equal to " +
str(np.sqrt(mean_squared_log_error(yTrainInternal, yTrainInternalPredict))))
print("Tke Kaggle metric score (RMSLE) on internal test data is equal to " +
str(np.sqrt(mean_squared_log_error(yTestInternal, yTestInternalPredict))))
Question
So with the above process, one will get an error when computing the Kaggle metric i.e. RMLSE because some values are negative. The funny thing is that if I change the test_size parameter from 0.5 to 0.2 then no more negative values. One could understand as more data gets used to train so the model performs better. But if I move it from 0.2 to 0.3 (less dramatic change i.e. ~100 training samples) then the issue of the model predicting negative values appear again.
Two questions :
Is this expected i.e. that the model is so sensitive to
training data ? This is even clearer because if test_size = 0.2
is used with shuffle = False then it works. If used when shuffle =
True, then the model starts predicting negative values.
How to deal with such behavior ? Obviously, this is a very simple
model (no standardization, no scaling, no regularization ...) but I
believe it is interesting to really understand what is going on in
this very simple model.

Is this expected i.e. that the model is so sensitive to training data ? This is even clearer because if test_size = 0.2 is used with shuffle = False then it works. If used when shuffle = True, then the model starts predicting negative values.
For your question, yes this split can matter!
How to deal with such behavior ? Obviously, this is a very simple model (no standardization, no scaling, no regularization ...) but I believe it is interesting to really understand what is going on in this very simple model.
Did you ever hear about cross-validation?
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
The concept is to train your classifier/regression with several datasplits, which always have a different train/test-split to avoid this behavior you are explaning, then you can really judge your prediction quality, as new data could also have several different structures.
So you run severaal Iterations and then judge about the outcome.

Related

Neural Network Python Gradient Descent

I am new to machine learning and trying to understand it (self-learning). So I grabbed a book (this one if interested: https://www.amazon.com/Neural-Networks-Unity-Programming-Windows/dp/1484236726) and started to read the first chapter. While reading, there are a few things I did not understand so I went to research online.
However, I still have trouble with a few points that I cannot understand even after so much reading and research:
How are we calculating l2_delta and l1_delta? (marked with #what is this part doing? in code below)
How does gradient descent relate? (I looked up the formula and tried to read a bit about it but I could not relate the one line code to the code I have down there)
Is that a network with 3 layers (layer 1: 3 input nodes, layer 2: not sure ,layer 3: 1 output node )
Neural Network Full Code:
trying to write my first neural network!
import numpy as np
#activation function (sigmoid , maps value between 0 and 1)
def sigmoid(x):
return 1/(1+np.exp(-x))
def derivative(x):
return x*(1-x)
#initialize input (4 training data (row), 3 features (col))
X = np.array([[0,0,1],[0,1,1],[1,0,1],[1,1,1]])
#initialize output for training data (4 training data (rows), 1 output for each (col))
Y = np.array([[0],[1],[1],[0]])
np.random.seed(1)
#synapses
syn0 = 2* np.random.random((3,4)) - 1
syn1 = 2* np.random.random((4,1)) - 1
for iter in range(60000):
#layers
l0 = X
l1 = sigmoid(np.dot(l0,syn0))
l2 = sigmoid(np.dot(l1,syn1))
#error
l2_error = Y - l2
if(iter % 10000 == 0): #only print error every 10000 steps to save time and limit the amount of output
print("Error L2: " + str (np.mean(np.abs(l2_error))))
#what is this part doing?
l2_delta = l2_error * derivative(l2)
l1_error = l2_delta.dot(syn1.T)
l1_delta = l1_error * derivative(l1)
if(iter % 10000 == 0): #only print error every 10000 steps to save time and limit the amount of output
print("Error L1: " + str (np.mean(np.abs(l1_error))))
#update weights
syn1 = syn1 + l1.T.dot(l2_delta) // derative with respect to cost function
syn0 = syn2 + l0.T.dot(l1_delta)
print(l2)
Thank you!
In general, layerwise computations (Hence the notation l1 and l2 above) is simply getting the dot product of a vector $x \in \mathbb{R}^n$ and a vector of weights in the same dimension, then applying the sigmoid function on each component .
Gradient Descent. - - - Imagine, in two dimensions say the graph of $f(x) = x^2$ Suppose, we don't know how to get it's minimum. Gradient descent will basically evaluate $f'(x)$ at various points, and check whether $f'(x)$ is close to zero

Find wrongly categorized samples from validation step

I am using a keras neural net for identifying category in which the data belongs.
self.model.compile(loss='categorical_crossentropy',
optimizer=keras.optimizers.Adam(lr=0.001, decay=0.0001),
metrics=[categorical_accuracy])
Fit function
history = self.model.fit(self.X,
{'output': self.Y},
validation_split=0.3,
epochs=400,
batch_size=32
)
I am interested in finding out which labels are getting categorized wrongly in the validation step. Seems like a good way to understand what is happening under the hood.
You can use model.predict_classes(validation_data) to get the predicted classes for your validation data, and compare these predictions with the actual labels to find out where the model was wrong. Something like this:
predictions = model.predict_classes(validation_data)
wrong = np.where(predictions != Y_validation)
If you are interested in looking 'under the hood', I'd suggest to use
model.predict(validation_data_x)
to see the scores for each class, for each observation of the validation set.
This should shed some light on which categories the model is not so good at classifying. The way to predict the final class is
scores = model.predict(validation_data_x)
preds = np.argmax(scores, axis=1)
be sure to use the proper axis for np.argmax (I'm assuming your observation axis is 1). Use preds to then compare with the real class.
Also, as another exploration you want to see the overall accuracy on this dataset, use
model.evaluate(x=validation_data_x, y=validation_data_y)
I ended up creating a metric which prints the "worst performing category id + score" on each iteration. Ideas from link
import tensorflow as tf
import numpy as np
class MaxIoU(object):
def __init__(self, num_classes):
super().__init__()
self.num_classes = num_classes
def max_iou(self, y_true, y_pred):
# Wraps np_max_iou method and uses it as a TensorFlow op.
# Takes numpy arrays as its arguments and returns numpy arrays as
# its outputs.
return tf.py_func(self.np_max_iou, [y_true, y_pred], tf.float32)
def np_max_iou(self, y_true, y_pred):
# Compute the confusion matrix to get the number of true positives,
# false positives, and false negatives
# Convert predictions and target from categorical to integer format
target = np.argmax(y_true, axis=-1).ravel()
predicted = np.argmax(y_pred, axis=-1).ravel()
# Trick from torchnet for bincounting 2 arrays together
# https://github.com/pytorch/tnt/blob/master/torchnet/meter/confusionmeter.py
x = predicted + self.num_classes * target
bincount_2d = np.bincount(x.astype(np.int32), minlength=self.num_classes**2)
assert bincount_2d.size == self.num_classes**2
conf = bincount_2d.reshape((self.num_classes, self.num_classes))
# Compute the IoU and mean IoU from the confusion matrix
true_positive = np.diag(conf)
false_positive = np.sum(conf, 0) - true_positive
false_negative = np.sum(conf, 1) - true_positive
# Just in case we get a division by 0, ignore/hide the error and set the value to 0
with np.errstate(divide='ignore', invalid='ignore'):
iou = false_positive / (true_positive + false_positive + false_negative)
iou[np.isnan(iou)] = 0
return np.max(iou).astype(np.float32) + np.argmax(iou).astype(np.float32)
~
usage:
custom_metric = MaxIoU(len(catagories))
self.model.compile(loss='categorical_crossentropy',
optimizer=keras.optimizers.Adam(lr=0.001, decay=0.0001),
metrics=[categorical_accuracy, custom_metric.max_iou])

Multivariate binary sequence prediction with CRF

this question is an extension of this one which focuses on LSTM as opposed to CRF. Unfortunately, I do not have any experience with CRFs, which is why I'm asking these questions.
Problem:
I would like to predict a sequence of binary signal for multiple, non-independent groups. My dataset is moderately small (~1000 records per group), so I would like to try a CRF model here.
Available data:
I have a dataset with the following variables:
Timestamps
Group
Binary signal representing activity
Using this dataset I would like to forecast group_a_activity and group_b_activity which are both 0 or 1.
Note that the groups are believed to be cross-correlated and additional features can be extracted from timestamps -- for simplicity we can assume that there is only 1 feature we extract from the timestamps.
What I have so far:
Here is the data setup that you can reproduce on your own machine.
# libraries
import re
import numpy as np
import pandas as pd
data_length = 18 # how long our data series will be
shift_length = 3 # how long of a sequence do we want
df = (pd.DataFrame # create a sample dataframe
.from_records(np.random.randint(2, size=[data_length, 3]))
.rename(columns={0:'a', 1:'b', 2:'extra'}))
df.head() # check it out
# shift (assuming data is sorted already)
colrange = df.columns
shift_range = [_ for _ in range(-shift_length, shift_length+1) if _ != 0]
for c in colrange:
for s in shift_range:
if not (c == 'extra' and s > 0):
charge = 'next' if s > 0 else 'last' # 'next' variables is what we want to predict
formatted_s = '{0:02d}'.format(abs(s))
new_var = '{var}_{charge}_{n}'.format(var=c, charge=charge, n=formatted_s)
df[new_var] = df[c].shift(s)
# drop unnecessary variables and trim missings generated by the shift operation
df.dropna(axis=0, inplace=True)
df.drop(colrange, axis=1, inplace=True)
df = df.astype(int)
df.head() # check it out
# a_last_03 a_last_02 ... extra_last_02 extra_last_01
# 3 0 1 ... 0 1
# 4 1 0 ... 0 0
# 5 0 1 ... 1 0
# 6 0 0 ... 0 1
# 7 0 0 ... 1 0
[5 rows x 15 columns]
Before we get to the CRF part, I suspect that I cannot use approach this problem from a multi-task learning point of view (predicting patterns for both A and B via one model) and therefore I'm going to have to predict each of them individually.
Now the CRF part. I've found some relevant example (here is one) but they all tend to predict a single class value based on a prior sequence.
Here is my attempt at using a CRF here:
import pycrfsuite
crf_features = [] # a container for features
crf_labels = [] # a container for response
# lets focus on group A only for this one
current_response = [c for c in df.columns if c.startswith('a_next')]
# predictors are going to have to be nested otherwise I'll run into problems with dimensions
current_predictors = [c for c in df.columns if not 'next' in c]
current_predictors = set([re.sub('_\d+$','',v) for v in current_predictors])
for index, row in df.iterrows():
# not sure if its an effective way to iterate over a DF...
iter_features = []
for p in current_predictors:
pred_feature = []
# note that 0/1 values have to be converted into booleans
for k in range(shift_length):
iter_pred_feature = p + '_{0:02d}'.format(k+1)
pred_feature.append(p + "=" + str(bool(row[iter_pred_feature])))
iter_features.append(pred_feature)
iter_response = [row[current_response].apply(lambda z: str(bool(z))).tolist()]
crf_labels.extend(iter_response)
crf_features.append(iter_features)
trainer = pycrfsuite.Trainer(verbose=True)
for xseq, yseq in zip(crf_features, crf_labels):
trainer.append(xseq, yseq)
trainer.set_params({
'c1': 0.0, # coefficient for L1 penalty
'c2': 0.0, # coefficient for L2 penalty
'max_iterations': 10, # stop earlier
# include transitions that are possible, but not observed
'feature.possible_transitions': True
})
trainer.train('testcrf.crfsuite')
tagger = pycrfsuite.Tagger()
tagger.open('testcrf.crfsuite')
tagger.tag(xseq)
# ['False', 'True', 'False']
It seems that I did manage to get it working, but I'm not sure if I've approached it correctly. I'll formulate my questions in the Questions section, but first, here is an alternative approach using keras_contrib package:
from keras import Sequential
from keras_contrib.layers import CRF
from keras_contrib.losses import crf_loss
# we are gonna have to revisit data prep stage again
# separate predictors and response
response_df_dict = {}
for g in ['a','b']:
response_df_dict[g] = df[[c for c in df.columns if 'next' in c and g in c]]
# reformat for LSTM
# the response for every row is a matrix with depth of 2 (the number of groups) and width = shift_length
# the predictors are of the same dimensions except the depth is not 2 but the number of predictors that we have
response_array_list = []
col_prefix = set([re.sub('_\d+$','',c) for c in df.columns if 'next' not in c])
for c in col_prefix:
current_array = df[[z for z in df.columns if z.startswith(c)]].values
response_array_list.append(current_array)
# reshape into samples (1), time stamps (2) and channels/variables (0)
response_array = np.array([response_df_dict['a'].values,response_df_dict['b'].values])
response_array = np.reshape(response_array, (response_array.shape[1], response_array.shape[2], response_array.shape[0]))
predictor_array = np.array(response_array_list)
predictor_array = np.reshape(predictor_array, (predictor_array.shape[1], predictor_array.shape[2], predictor_array.shape[0]))
model = Sequential()
model.add(CRF(2, input_shape=(predictor_array.shape[1],predictor_array.shape[2])))
model.summary()
model.compile(loss=crf_loss, optimizer='adam', metrics=['accuracy'])
model.fit(predictor_array, response_array, epochs=10, batch_size=1)
model_preds = model.predict(predictor_array) # not gonna worry about train/test split here
Questions:
My main question is whether or not I've constructed both of my CRF models correctly. What worries me is that (1) there is not a lot of documentation out there on CRF models, (2) CRFs are mainly used for predicting a single label given a sequence, (3) the input features are nested and (4) when used in a multi-tasked fashion, I'm not sure if it is valid.
I have a few extra questions as well:
Is a CRF appropriate for this problem?
How are the 2 approaches (one based on pycrfuite and one based on keras_contrib) different and what are their advantages/disadvantages?
In a more general sense, what is the advantage of combining CRF and LSTM models into one (like one discussed here)
Many thanks!

Tensorflow Extracting Classification Predictions

I've a tensorflow NN model for classification of one-hot-encoded group labels (groups are exclusive), which ends with (layerActivs[-1] are the activations of the final layer):
probs = sess.run(tf.nn.softmax(layerActivs[-1]),...)
classes = sess.run(tf.round(probs))
preds = sess.run(tf.argmax(classes))
The tf.round is included to force any low probabilities to 0. If all probabilities are below 50% for an observation, this means that no class will be predicted. I.e., if there are 4 classes, we could have probs[0,:] = [0.2,0,0,0.4], so classes[0,:] = [0,0,0,0]; preds[0] = 0 follows.
Obviously this is ambiguous, as it is the same result that would occur if we had probs[1,:]=[.9,0,.1,0] -> classes[1,:] = [1,0,0,0] -> 1 preds[1] = 0. This is a problem when using the tensorflow builtin metrics class, as the functions can't distinguish between no prediction, and prediction in class 0. This is demonstrated by this code:
import numpy as np
import tensorflow as tf
import pandas as pd
''' prepare '''
classes = 6
n = 100
# simulate data
np.random.seed(42)
simY = np.random.randint(0,classes,n) # pretend actual data
simYhat = np.random.randint(0,classes,n) # pretend pred data
truth = np.sum(simY == simYhat)/n
tabulate = pd.Series(simY).value_counts()
# create placeholders
lab = tf.placeholder(shape=simY.shape, dtype=tf.int32)
prd = tf.placeholder(shape=simY.shape, dtype=tf.int32)
AM_lab = tf.placeholder(shape=simY.shape,dtype=tf.int32)
AM_prd = tf.placeholder(shape=simY.shape,dtype=tf.int32)
# create one-hot encoding objects
simYOH = tf.one_hot(lab,classes)
# create accuracy objects
acc = tf.metrics.accuracy(lab,prd) # real accuracy with tf.metrics
accOHAM = tf.metrics.accuracy(AM_lab,AM_prd) # OHE argmaxed to labels - expected to be correct
# now setup to pretend we ran a model & generated OHE predictions all unclassed
z = np.zeros(shape=(n,classes),dtype=float)
testPred = tf.constant(z)
''' run it all '''
# setup
sess = tf.Session()
sess.run([tf.global_variables_initializer(),tf.local_variables_initializer()])
# real accuracy with tf.metrics
ACC = sess.run(acc,feed_dict = {lab:simY,prd:simYhat})
# OHE argmaxed to labels - expected to be correct, but is it?
l,p = sess.run([simYOH,testPred],feed_dict={lab:simY})
p = np.argmax(p,axis=-1)
ACCOHAM = sess.run(accOHAM,feed_dict={AM_lab:simY,AM_prd:p})
sess.close()
''' print stuff '''
print('Accuracy')
print('-known truth: %0.4f'%truth)
print('-on unprocessed data: %0.4f'%ACC[1])
print('-on faked unclassed labels data (s.b. 0%%): %0.4f'%ACCOHAM[1])
print('----------\nTrue Class Freqs:\n%r'%(tabulate.sort_index()/n))
which has the output:
Accuracy
-known truth: 0.1500
-on unprocessed data: 0.1500
-on faked unclassed labels data (s.b. 0%): 0.1100
----------
True Class Freqs:
0 0.11
1 0.19
2 0.11
3 0.25
4 0.17
5 0.17
dtype: float64
Note freq for class 0 is same as faked accuracy...
I experimented with setting a value of preds to np.nan for observations with no predictions, but tf.metrics.accuracy throws ValueError: cannot convert float NaN to integer; also tried np.inf but got OverflowError: cannot convert float infinity to integer.
How can I convert the rounded probabilities to class predictions, but appropriately handle unpredicted observations?
This has gone long enough without an answer, so I'll post here as the answer my solution. I convert belonging probabilities to class predictions with a new function that has 3 main steps:
set any NaN probabilities to 0
set any probabilities below 1/num_classes to 0
use np.argmax() to extract predicted classes, then set any unclassed observations to a uniformly selected class
The resultant vector of integer class labels can be passed to the tf.metrics functions. My function below:
def predFromProb(classProbs):
'''
Take in as input an (m x p) matrix of m observations' class probabilities in
p classes and return an m-length vector of integer class labels (0...p-1).
Probabilities at or below 1/p are set to 0, as are NaNs; any unclassed
observations are randomly assigned to a class.
'''
numClasses = classProbs.shape[1]
# zero out class probs that are at or below chance, or NaN
probs = classProbs.copy()
probs[np.isnan(probs)] = 0
probs = probs*(probs > 1/numClasses)
# find any un-classed observations
unpred = ~np.any(probs,axis=1)
# get the predicted classes
preds = np.argmax(probs,axis=1)
# randomly classify un-classed observations
rnds = np.random.randint(0,numClasses,np.sum(unpred))
preds[unpred] = rnds
return preds

How to predict Label of an email using a trained NB Classifier in sklearn?

I have created a Gaussian Naive Bayes classifier on a email (spam/not spam) dataset and was able to run it successfully. I vectorized the data, divided in it train and test sets and then calculated the accuracy, all the features that are present in the sklearn-Gaussian Naive Bayes classifier.
Now I want to be able to use this classifier to predict "labels" for new emails - whether they are by spam or not.
For example say I have an email. I want to feed it to my classifier and get the prediction as to whether it is a spam or not. How can I achieve this? Please Help.
Code for classifier file.
#!/usr/bin/python
import sys
from time import time
import logging
# Display progress logs on stdout
logging.basicConfig(level = logging.DEBUG, format = '%(asctime)s %(message)s')
sys.path.append("../DatasetProcessing/")
from vectorize_split_dataset import preprocess
### features_train and features_test are the features
for the training and testing datasets, respectively### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()
#########################################################
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
t0 = time()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print("training time:", round(time() - t0, 3), "s")
print(clf.score(features_test, labels_test))
## Printing Metrics
for Training and Testing
print("No. of Testing Features:" + str(len(features_test)))
print("No. of Testing Features Label:" + str(len(labels_test)))
print("No. of Training Features:" + str(len(features_train)))
print("No. of Training Features Label:" + str(len(labels_train)))
print("No. of Predicted Features:" + str(len(pred)))
## Calculating Classifier Performance
from sklearn.metrics import classification_report
y_true = labels_test
y_pred = pred
labels = ['0', '1']
target_names = ['class 0', 'class 1']
print(classification_report(y_true, y_pred, target_names = target_names, labels = labels))
# How to predict label of a new text
new_text = "You won a lottery at UK lottery commission. Reply to claim it"
Code for Vectorization
#!/usr/bin/python
import os
import pickle
import numpy
numpy.random.seed(42)
path = os.path.dirname(os.path.abspath(__file__))
### The words(features) and label_data(labels), already largely processed.###These files should have been created beforehand
feature_data_file = path + "./createdDataset/dataSet.pkl"
label_data_file = path + "./createdDataset/dataLabel.pkl"
feature_data = pickle.load(open(feature_data_file, "rb"))
label_data = pickle.load(open(label_data_file, "rb"))
### test_size is the percentage of events assigned to the test set(the### remainder go into training)### feature matrices changed to dense representations
for compatibility with### classifier functions in versions 0.15.2 and earlier
from sklearn import cross_validation
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(feature_data, label_data, test_size = 0.1, random_state = 42)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf = True, max_df = 0.5, stop_words = 'english')
features_train = vectorizer.fit_transform(features_train)
features_test = vectorizer.transform(features_test)#.toarray()
## feature selection to reduce dimensionality
from sklearn.feature_selection import SelectPercentile, f_classif
selector = SelectPercentile(f_classif, percentile = 5)
selector.fit(features_train, labels_train)
features_train_transformed_reduced = selector.transform(features_train).toarray()
features_test_transformed_reduced = selector.transform(features_test).toarray()
features_train = features_train_transformed_reduced
features_test = features_test_transformed_reduced
def preprocess():
return features_train, features_test, labels_train, labels_test
Code for dataset generation
#!/usr/bin/python
import os
import pickle
import re
import sys
# sys.path.append("../tools/")
""
"
Starter code to process the texts of accuate and inaccurate category to extract
the features and get the documents ready for classification.
The list of all the texts from accurate category are in the accurate_files list
likewise for texts of inaccurate category are in (inaccurate_files)
The data is stored in lists and packed away in pickle files at the end.
"
""
accurate_files = open("./rawDatasetLocation/accurateFiles.txt", "r")
inaccurate_files = open("./rawDatasetLocation/inaccurateFiles.txt", "r")
label_data = []
feature_data = []
### temp_counter is a way to speed up the development--there are### thousands of lines of accurate and inaccurate text, so running over all of them### can take a long time### temp_counter helps you only look at the first 200 lines in the list so you### can iterate your modifications quicker
temp_counter = 0
for name, from_text in [("accurate", accurate_files), ("inaccurate", inaccurate_files)]:
for path in from_text: ###only look at first 200 texts when developing### once everything is working, remove this line to run over full dataset
temp_counter = 1
if temp_counter < 200:
path = os.path.join('..', path[: -1])
print(path)
text = open(path, "r")
line = text.readline()
while line: ###use a
function parseOutText to extract the text from the opened text# stem_text = parseOutText(text)
stem_text = text.readline().strip()
print(stem_text)### use str.replace() to remove any instances of the words# stem_text = stem_text.replace("germani", "")### append the text to feature_data
feature_data.append(stem_text)### append a 0 to label_data
if text is from Sara, and 1
if text is from Chris
if (name == "accurate"):
label_data.append("0")
elif(name == "inaccurate"):
label_data.append("1")
line = text.readline()
text.close()
print("texts processed")
accurate_files.close()
inaccurate_files.close()
pickle.dump(feature_data, open("./createdDataset/dataSet.pkl", "wb"))
pickle.dump(label_data, open("./createdDataset/dataLabel.pkl", "wb"))
Also I want to know whether i can incrementally train the classifier meaning thereby that retrain a created model with newer data for refining the model over time?
I would be really glad if someone can help me out with this. I am really stuck at this point.
You are already using your model to predict labels of emails in your test set. This is what pred = clf.predict(features_test) does. If you want to see these labels, do print pred.
But perhaps you what to know how you can predict labels for emails that you discover in the future and that are not currently in your test set? If so, you can think of your new email(s) as a new test set. As with your previous test set, you will need to run several key processing steps on the data:
1) The first thing you need to do is to generate features for your new email data. The feature generation step is not included in your code above, but will need to occur.
2) You are using a Tfidf vectorizer, which converts a collection of documents to a matrix of Tfidf features based upon term frequency and inverse document frequency. You need to put your new email test feature data through the vectorizer that you fit on your training data.
3) Then your new email test feature data will need to go through dimensionality reduction using the same selector that you fit on your training data.
4) Finally, run predict on your new test data. Use print pred if you want to view the new label(s).
To respond to your final question about iteratively re-training your model, yes you definitely can do this. It's just a matter of selecting a frequency, producing a script that expands your data set with incoming data, then re-running all steps from there, from pre-processing to Tfidf vectorization, to dimensionality reduction, to fitting, and prediction.

Resources