How do I make my algo work with KNN text classification? - python-3.x

Trying to make my classification accepting a text (string) and not just a number (numeric). Working with data, carrying a load of pulled articles, I want the classification algo to show which ones to proceed with and which ones to drop. Applying a number, things are working just fine, yet this is not very intuitive, although I know that the number represents a relationship to one of the two classes I am working with.
How do I change the logic in the algo to make it accept a text as search criteria and not just an anonymous number, picked from the 'Unique_id' column? Columns are, btw...'Title', 'Abstract', 'Relevant', 'Label', 'Unique_id'. The reason for concatenating df's at algo end is that I want to compare results. Finally. it should be noted that the col 'Label' consists of a list of keywords, so basically I want the algo to read from that col.
I did try, reading from data sources, changing the 'index_col='Unique_id' to 'index_col='Label', but that did not work out either.
An example of what I want:
print("\nPrint KNN1")
print(get_closest_neighs1('search word'), "\n")
print("\nPrint KNN2")
print(get_closest_neighs2('search word'), "\n")
print("\nPrint KNN3")
print(get_closest_neighs3('search word'), "\n")
This is the full code (view end of algo to see above example as it runs today, using a number to identify nearest neighbor):
import pandas as pd
print("\nPerforming Analysis using Text Classification")
data = pd.read_csv('File_1_coltest_demo.csv', sep=';', encoding="ISO-8859-1").dropna()
data['Unique_id'] = data.groupby(['Title', 'Abstract', 'Relevant']).ngroup()
data.to_csv('File_2_coltest_demo_KNN.csv', sep=';', encoding="ISO-8859-1", index=False)
data1 = pd.read_csv('File_2_coltest_demo_KNN.csv', sep=';', encoding="ISO-8859-1", index_col='Unique_id')
data2 = pd.DataFrame(data1, columns=['Abstract', 'Relevant'])
data2.to_csv('File_3_coltest_demo_KNN_reduced.csv', sep=';', encoding="ISO-8859-1", index=False)
print("\nData top 25 items")
print(data2.head(25))
print("\nData info")
print(data2.info())
print("\nData columns")
print(data2.columns)
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
token = RegexpTokenizer(r'[a-zA-Z0-9]+')
cv = CountVectorizer(lowercase=True, stop_words='english', ngram_range=(1, 1), tokenizer=token.tokenize)
text_counts = cv.fit_transform(data2['Abstract'])
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
text_counts, data2['Abstract'], test_size=0.5, random_state=1)
print("\nTF IDF")
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer()
text_tf = tf.fit_transform(data2['Abstract'])
print(text_tf)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
text_tf, data2['Abstract'], test_size=0.3, random_state=123)
from sklearn.neighbors import NearestNeighbors
import pandas as pd
nbrs = NearestNeighbors(n_neighbors=20, metric='euclidean').fit(text_tf)
def get_closest_neighs1(Abstract):
row = data2.index.get_loc(Abstract)
distances, indices = nbrs.kneighbors(text_tf.getrow(row))
names_similar = pd.Series(indices.flatten()).map(data2.reset_index()['Abstract'])
result = pd.DataFrame({'distance1' : distances.flatten(), 'Abstract' : names_similar}) # 'Unique_id' : names_similar,
return result
def get_closest_neighs2(Unique_id):
row = data2.index.get_loc(Unique_id)
distances, indices = nbrs.kneighbors(text_tf.getrow(row))
names_similar = pd.Series(indices.flatten()).map(data2.reset_index()['Unique_id'])
result1 = pd.DataFrame({'Distance' : distances.flatten() / 10, 'Unique_id' : names_similar}) # 'Unique_id' : names_similar,
return result1
def get_closest_neighs3(Relevant):
row = data2.index.get_loc(Relevant)
distances, indices = nbrs.kneighbors(text_tf.getrow(row))
names_similar = pd.Series(indices.flatten()).map(data2.reset_index()['Relevant'])
result2 = pd.DataFrame({'distance2' : distances.flatten(), 'Relevant' : names_similar}) # 'Unique_id' : names_similar,
return result2
print("\nPrint KNN1")
print(get_closest_neighs1(114), "\n")
print("\nPrint KNN2")
print(get_closest_neighs2(114), "\n")
print("\nPrint KNN3")
print(get_closest_neighs3(114), "\n")
data3 = pd.DataFrame(get_closest_neighs1(114))
data4 = pd.DataFrame(get_closest_neighs2(114))
data5 = pd.DataFrame(get_closest_neighs3(114))
del data5['distance2']
data6 = pd.concat([data3, data4, data5], axis=1).reindex(data3.index)
del data6['distance1']
data6.to_csv('File_4_coltest_demo_KNN_results.csv', sep=';', encoding="ISO-8859-1", index=False)

If I understand you right you are trying to do this:
You have vectorised all your documents by their "Abstract" field. Therefore documents with abstracts with similar word distributions should be nearby in TFIDF space.
You want to find the nearest neighbours to a document which has the search keyword.
Therefore you'd need to search the original corpus for the first or all documents which have that keyword
then find the index of that/those document(s), and then find their neighbours.
if there are multiple documents with that keyword, you would need to sort the index list and merge the overall results somehow with some weightings.
If this is true, then the keyword search/lookup isn't really "inside" the model, it's just preselecting a document from the corpus. Once you have the document index, you can perform the KNN (repeatedly).
I'm not hugely familiar with Pandas, but I've done this kind of thing "manually" before e.g. by keeping the document titles in a separate array, with a map to the indexes.
I would imagine you would need to replace your data2.index.get_loc() calls with an iteration over the column values for "Label" and do a simple string search on each. Or does Pandas provide search functions within the corpus?
e.g. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html#pandas.DataFrame.query

Related

BERT binary Textclassification get different results every run

I do binary text classification with BERT from the Simpletransformer.
I work in Colab with GPU runtime type.
I have generated train and test set with the sklearn StratifiedKFold Method. I have two files with the dictionaries containing my folds.
I run my classification in the following while loop:
from sklearn.metrics import matthews_corrcoef, f1_score
import sklearn
counter = 0
resultatos = []
while counter != len(trainfolds):
model = ClassificationModel('bert', 'bert-base-multilingual-cased',args={'num_train_epochs': 4, 'learning_rate': 1e-5, 'fp16': False,
'max_seq_length': 160, 'train_batch_size': 24,'eval_batch_size': 24 ,
'warmup_ratio': 0.0,'weight_decay': 0.00,
'overwrite_output_dir': True})
print("start with fold_{}".format(counter))
trainfolds["{}_fold".format(counter)].to_csv("/content/data/train.tsv", sep="\t", index = False, header=False)
print("{}_fold Train als train.tsv exportiert". format(counter))
testfolds["{}_fold".format(counter)].to_csv("/content/data/dev.tsv", sep="\t", index = False, header=False)
print("{}_fold test als train.tsv exportiert". format(counter))
train_df = pd.read_csv("/content/data/train.tsv", delimiter='\t', header=None)
eval_df = df = pd.read_csv("/content/data/dev.tsv", delimiter='\t', header=None)
train_df = pd.DataFrame({
'text': train_df[3].replace(r'\n', ' ', regex=True),
'label':train_df[1]})
eval_df = pd.DataFrame({
'text': eval_df[3].replace(r'\n', ' ', regex=True),
'label':eval_df[1]})
model.train_model(train_df)
result, model_outputs, wrong_predictions = model.eval_model(eval_df, f1 = sklearn.metrics.f1_score)
print(result)
resultatos.append(result)
shutil.rmtree("outputs")
shutil.rmtree("cache_dir")
#shutil.rmtree("runs")
counter += 1
And i get different Results Running this code for the same Folds:
Here for example the F1 Scores for two runs:
0.6237942122186495
0.6189111747851003
0.6172839506172839
0.632183908045977
0.6182965299684542
0.5942492012779553
0.6025641025641025
0.6153846153846154
0.6390532544378699
0.6627906976744187
The F1 Score is: 0.6224511646974427
0.6064516129032258
0.6282420749279539
0.6402439024390244
0.5971014492753622
0.6135693215339232
0.6191950464396285
0.6382978723404256
0.6388059701492537
0.6097560975609756
0.5956112852664576
The F1 Score is: 0.618727463283623
How can they be that diffeerent for the same folds?
What i tried already is give a fixed Random seed right before my loop starts:
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
I came up with approach of having the Model initialized in the loop because, when its outside the loop, it somehow remembers what it has learned - that means after the 2nd fold I get f1 score of almost one - despite the fact that i delete the cache..
I figured it out myself, just set all seeds plus torch.backends.cudnn.deterministic = True
and
torch.backends.cudnn.benchmark = False
like shown in this post and i get the same results for all runs!

Semantic similarity to compare two columns in data frames using sklearn

i face an issue to pass a function to compare between two column
import nltk, string
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(tokenizer=normalize, stop_words='english')
def cosine_sim1(text1, text2):
tfidf = vectorizer.fit_transform([text1, text2])
return ((tfidf * tfidf.T).A)[0,1]
after i apply the function
cosine_sim1('like football', 'football')
The results is:
0.5797386715376657
I face a little issue to pass that function between two column in dataframe to calculate the score. here is a small sample of the data
d = pd.DataFrame({'A': ['my name is', 'i live in', 'i like football'], 'B': ['london is nice city', 'london city', 'football']})
i have tried to do like that. However there are some errors appears.
def cosine_sim1(text1, text2):
tfidf = vectorizer.fit_transform([text1(d['A']), text2(d['B'])])
return ((tfidf * tfidf.T).A)[0,1]
d.apply(cosine_sim1, axis=1)
The error is:
TypeError: ("cosine_sim1() missing 1 required positional argument: 'text2'", 'occurred at index 0')
I believe it should be
def cosine_sim1(text1, text2):
tfidf = vectorizer.fit_transform([text1, text2])
return ((tfidf * tfidf.T).A)[0,1]
d.apply(lambda x: cosine_sim1(x.A, x.B), axis=1)
You are applying function to DataFrame but you are not passing the parameters that you have defined.

How do I fix KeyError bug in my code while implementing K-Nearest Neighbours from scratch?

I am trying to implement K-Nearest Neighbours algorithm from scratch in Python. The code I wrote worked well for the Breast-Cancer-Wisconsin.csv dataset.
However, the same code when I try to run for Iris.csv dataset, my implementation fails and gives KeyError.
The only difference in the 2 datasets is the fact that in Breast-Cancer-Wisconsin.csv there are only 2 classes ('2' for malignant and '4' for benign) and both the labels are integers wheres in Iris.csv there are 3 classes ('setosa', 'versicolor', 'virginica') and all these 3 labels are in string type.
Here is the code I wrote (for Iris.csv) :
import numpy as np
from math import sqrt
import matplotlib.pyplot as plt
from matplotlib import style
from collections import Counter
import warnings
import pandas as pd
import random
style.use('fivethirtyeight')
dataset = {'k':[[1,2],[2,3],[3,1]], 'r':[[6,5],[7,7],[8,6]]}
new_features = [5,7]
#[[plt.scatter(j[0],j[1], s=100, color=i) for j in dataset[i]] for i in dataset]
#plt.scatter(new_features[0], new_features[1], s=100)
#plt.show()
def k_nearest_neighbors(data, predict, k=3):
if len(data) >= k:
warnings.warn('K is set to a value less than total voting groups!')
distances = []
for group in data:
for features in data[group]:
euclidean_distance = np.linalg.norm(np.array(features) - np.array(predict))
distances.append([euclidean_distance, group])
votes = [i[1] for i in sorted(distances)[:k]]
vote_result = Counter(votes).most_common(1)[0][0]
return vote_result
df = pd.read_csv('iris.csv')
df.replace('?', -99999, inplace=True)
#full_data = df.astype(float).values.tolist()
#random.shuffle(full_data)
test_size = 0.2
train_set = {'setosa':[], 'versicolor':[], 'virginica':[]}
test_set = {'setosa':[], 'versicolor':[], 'virginica':[]}
train_data = full_data[:-int(test_size*len(full_data))]
test_data = full_data[-int(test_size*len(full_data)):]
for i in train_data:
train_set[i[-1]].append(i[:-1])
for i in test_data:
test_set[i[-1]].append(i[:-1])
correct = 0
total = 0
for group in test_set:
for data in test_set[group]:
vote = k_nearest_neighbors(train_set, data, k=5)
if group == vote:
correct += 1
total += 1
print('Accuracy : ', correct/total)
When I run the above code, I get a KeyError message at line number 49.
Could anyone please explain to me where I am going wrong? Also, it would be great if someone could point out how do I modify this algorithm to classify multiple classes (instead of 2 or 3) in the future?
Also, how do I handle if the classes are in string type instead of integer?
one solution I thought of was to convert all string types to integer types and try to solve but would that work?
REFERENCES
Iris.csv
Breas-Cancer-Wisconsin.csv
Let's start from your last question:
one solution I thought of was to convert all string types to integer types and try to solve but would that work?
Yes, that would work. You shouldn't have to hardcode the names of all the classes of every problem in your code. Instead, you can just write a function that reads all the different values for the class attribute, and assigns a numeric value to each different one.
Could anyone please explain to me where I am going wrong?
Most likely, the problem is that you are reading an instance whose class attribute is not 'setosa', 'versicolor', 'virginica' (something like Iris-setosa perhaps?). The idea above should fix this problem.
Also, it would be great if someone could point out how do I modify this algorithm to classify multiple classes (instead of 2 or 3) in the future?
As discuss before, you just need to avoid hard-coding the names of the classes in your code
Also, how do I handle if the classes are in string type instead of integer?
def get_class_values(data):
classes_seen = {}
for i in data:
_class = data[-1]
if _class not in classes_seen:
classes_seen[_class] = len(classes_seen)
return classes_seen
A function like this one would return a mapping between all your classes (no matter the type) and numeric codes (from 0 to N-1). Using this mapping would also solve all the problems mentioned before.
Convert String Labels In CSV Files To Integer Labels
After going through some GitHub repos I came across a very simple yet elegant piece of code that solves the above problem. Hope it helps those who have faced this problem before (beginners especially!)
% read the csv file
df = pd.read_csv('iris.csv')
% clean the data file
df.replace('?', -99999, inplace=True)
% convert the string classes into integer types.
% integers are assigned from 0 to N-1.
% species is the name of the column which has class labels.
df['species'] = df['species'].astype('category')
df['species_value'] = df['species'].cat.codes
df.drop(['species'], 1, inplace=True)
% convert the data frame to list
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
Post Debugging
Turns out that we need not use the above piece of code also, i.e I can get the answer without explicitly converting the string labels into integer labels (using the above code).
I have posted the original code after some minor changes (below) and the key error is now fixed. Also, I am now getting an accuracy of 97% to 100% (only on IRIS dataset).
test_size = 0.2
train_set = {0:[], 1:[], 2:[]}
test_set = {0:[], 1:[], 2:[]}
That is the only change you need to make to the original code I posted in order to make it work!! Simple!
However, please note that the numbers have to be given as integers and not string (otherwise it would lead to key error!).
Wrap-Up
There are some commented lines in the original code which I thought would be good to explain in case somebody ran into some issues. Here's one snippet with the comments removed (compare with original code in the question).
df = pd.read_csv('iris.csv')
df.replace('?', -99999, inplace=True)
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
Here's the output you get:
ValueError: could not convert string to float: 'virginica'
What went wrong?
Note that here we did not convert the string labels into integer labels. Therefore, when we tried to convert the data in the CSV to float values, the kernel threw an error because a string cannot be converted to float!
So one way to go about it is that you don't convert the data into floating point values and then you won't get this error. However in many cases you need to convert all the data into floating point (for eg.. normalisation, accuracy, long mathematical calculations, prevention of loss of precision etc etc..).
Hence after heavy debugging and going through a lot of articles I finally came up with a simple version of the original code (below):
import numpy as np
from math import sqrt
import matplotlib.pyplot as plt
from matplotlib import style
from collections import Counter
import warnings
import pandas as pd
import random
def k_nearest_neighbors(data, predict, k=3):
if len(data) >= k:
warnings.warn('K is set to a value less than total voting groups!')
distances = []
for group in data:
for features in data[group]:
euclidean_distance = np.linalg.norm(np.array(features) - np.array(predict))
distances.append([euclidean_distance, group])
votes = [i[1] for i in sorted(distances)[:k]]
vote_result = Counter(votes).most_common(1)[0][0]
return vote_result
df = pd.read_csv('iris.csv')
df.replace('?', -99999, inplace=True)
df['species'] = df['species'].astype('category')
df['species_value'] = df['species'].cat.codes
df.drop(['species'], 1, inplace=True)
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
test_size = 0.2
train_set = {0:[], 1:[], 2:[]}
test_set = {0:[], 1:[], 2:[]}
train_data = full_data[:-int(test_size*len(full_data))]
test_data = full_data[-int(test_size*len(full_data)):]
for i in train_data:
train_set[i[-1]].append(i[:-1])
for i in test_data:
test_set[i[-1]].append(i[:-1])
correct = 0
total = 0
for group in test_set:
for data in test_set[group]:
vote = k_nearest_neighbors(train_set, data, k=5)
if group == vote:
correct += 1
total += 1
print('Accuracy : ', (correct/total)*100,'%')
Hope this helps!

Visualise word2vec generated from gensim using t-sne

I have trained a doc2vec and corresponding word2vec on my own corpus using gensim. I want to visualise the word2vec using t-sne with the words. As in, each dot in the figure has the "word" also with it.
I looked at a similar question here : t-sne on word2vec
Following it, I have this code :
import gensim
import gensim.models as g
from sklearn.manifold import TSNE
import re
import matplotlib.pyplot as plt
modelPath="/Users/tarun/Desktop/PE/doc2vec/model3_100_newCorpus60_1min_6window_100trainEpoch.bin"
model = g.Doc2Vec.load(modelPath)
X = model[model.wv.vocab]
print len(X)
print X[0]
tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X[:1000,:])
plt.scatter(X_tsne[:, 0], X_tsne[:, 1])
plt.show()
This gives a figure with dots but no words. That is I don't know which dot is representative of which word. How can I display the word with the dot?
Two parts to the answer: how to get the word labels, and how to plot the labels on a scatterplot.
Word labels in gensim's word2vec
model.wv.vocab is a dict of {word: object of numeric vector}. To load the data into X for t-SNE, I made one change.
vocab = list(model.wv.key_to_index)
X = model.wv[vocab]
This accomplishes two things: (1) it gets you a standalone vocab list for the final dataframe to plot, and (2) when you index model, you can be sure that you know the order of the words.
Proceed as before with
tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X)
Now let's put X_tsne together with the vocab list. This is easy with pandas, so import pandas as pd if you don't have that yet.
df = pd.DataFrame(X_tsne, index=vocab, columns=['x', 'y'])
The vocab words are the indices of the dataframe now.
I don't have your dataset, but in the other SO you mentioned, an example df that uses sklearn's newsgroups would look something like
x y
politics -1.524653e+20 -1.113538e+20
worry 2.065890e+19 1.403432e+20
mu -1.333273e+21 -5.648459e+20
format -4.780181e+19 2.397271e+19
recommended 8.694375e+20 1.358602e+21
arguing -4.903531e+19 4.734511e+20
or -3.658189e+19 -1.088200e+20
above 1.126082e+19 -4.933230e+19
Scatterplot
I like the object-oriented approach to matplotlib, so this starts out a little different.
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.scatter(df['x'], df['y'])
Lastly, the annotate method will label coordinates. The first two arguments are the text label and the 2-tuple. Using iterrows(), this can be very succinct:
for word, pos in df.iterrows():
ax.annotate(word, pos)
[Thanks to Ricardo in the comments for this suggestion.]
Then do plt.show() or fig.savefig(). Depending on your data, you'll probably have to mess with ax.set_xlim and ax.set_ylim to see into a dense cloud. This is the newsgroup example without any tweaking:
You can modify dot size, color, etc., too. Happy fine-tuning!
With the following, you can convert your model to a TSV and then use this page for visualization.
with open(self.word_tensors_TSV, 'bw') as file_vector, open(self.word_meta_TSV, 'bw') as file_metadata:
for word in model.wv.vocab:
file_metadata.write((word + '\n').encode('utf-8', errors='replace'))
vector_row = '\t'.join(str(x) for x in model[word])
file_vector.write((vector_row + '\n').encode('utf-8', errors='replace'))
:)

How to predict Label of an email using a trained NB Classifier in sklearn?

I have created a Gaussian Naive Bayes classifier on a email (spam/not spam) dataset and was able to run it successfully. I vectorized the data, divided in it train and test sets and then calculated the accuracy, all the features that are present in the sklearn-Gaussian Naive Bayes classifier.
Now I want to be able to use this classifier to predict "labels" for new emails - whether they are by spam or not.
For example say I have an email. I want to feed it to my classifier and get the prediction as to whether it is a spam or not. How can I achieve this? Please Help.
Code for classifier file.
#!/usr/bin/python
import sys
from time import time
import logging
# Display progress logs on stdout
logging.basicConfig(level = logging.DEBUG, format = '%(asctime)s %(message)s')
sys.path.append("../DatasetProcessing/")
from vectorize_split_dataset import preprocess
### features_train and features_test are the features
for the training and testing datasets, respectively### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()
#########################################################
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
t0 = time()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print("training time:", round(time() - t0, 3), "s")
print(clf.score(features_test, labels_test))
## Printing Metrics
for Training and Testing
print("No. of Testing Features:" + str(len(features_test)))
print("No. of Testing Features Label:" + str(len(labels_test)))
print("No. of Training Features:" + str(len(features_train)))
print("No. of Training Features Label:" + str(len(labels_train)))
print("No. of Predicted Features:" + str(len(pred)))
## Calculating Classifier Performance
from sklearn.metrics import classification_report
y_true = labels_test
y_pred = pred
labels = ['0', '1']
target_names = ['class 0', 'class 1']
print(classification_report(y_true, y_pred, target_names = target_names, labels = labels))
# How to predict label of a new text
new_text = "You won a lottery at UK lottery commission. Reply to claim it"
Code for Vectorization
#!/usr/bin/python
import os
import pickle
import numpy
numpy.random.seed(42)
path = os.path.dirname(os.path.abspath(__file__))
### The words(features) and label_data(labels), already largely processed.###These files should have been created beforehand
feature_data_file = path + "./createdDataset/dataSet.pkl"
label_data_file = path + "./createdDataset/dataLabel.pkl"
feature_data = pickle.load(open(feature_data_file, "rb"))
label_data = pickle.load(open(label_data_file, "rb"))
### test_size is the percentage of events assigned to the test set(the### remainder go into training)### feature matrices changed to dense representations
for compatibility with### classifier functions in versions 0.15.2 and earlier
from sklearn import cross_validation
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(feature_data, label_data, test_size = 0.1, random_state = 42)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf = True, max_df = 0.5, stop_words = 'english')
features_train = vectorizer.fit_transform(features_train)
features_test = vectorizer.transform(features_test)#.toarray()
## feature selection to reduce dimensionality
from sklearn.feature_selection import SelectPercentile, f_classif
selector = SelectPercentile(f_classif, percentile = 5)
selector.fit(features_train, labels_train)
features_train_transformed_reduced = selector.transform(features_train).toarray()
features_test_transformed_reduced = selector.transform(features_test).toarray()
features_train = features_train_transformed_reduced
features_test = features_test_transformed_reduced
def preprocess():
return features_train, features_test, labels_train, labels_test
Code for dataset generation
#!/usr/bin/python
import os
import pickle
import re
import sys
# sys.path.append("../tools/")
""
"
Starter code to process the texts of accuate and inaccurate category to extract
the features and get the documents ready for classification.
The list of all the texts from accurate category are in the accurate_files list
likewise for texts of inaccurate category are in (inaccurate_files)
The data is stored in lists and packed away in pickle files at the end.
"
""
accurate_files = open("./rawDatasetLocation/accurateFiles.txt", "r")
inaccurate_files = open("./rawDatasetLocation/inaccurateFiles.txt", "r")
label_data = []
feature_data = []
### temp_counter is a way to speed up the development--there are### thousands of lines of accurate and inaccurate text, so running over all of them### can take a long time### temp_counter helps you only look at the first 200 lines in the list so you### can iterate your modifications quicker
temp_counter = 0
for name, from_text in [("accurate", accurate_files), ("inaccurate", inaccurate_files)]:
for path in from_text: ###only look at first 200 texts when developing### once everything is working, remove this line to run over full dataset
temp_counter = 1
if temp_counter < 200:
path = os.path.join('..', path[: -1])
print(path)
text = open(path, "r")
line = text.readline()
while line: ###use a
function parseOutText to extract the text from the opened text# stem_text = parseOutText(text)
stem_text = text.readline().strip()
print(stem_text)### use str.replace() to remove any instances of the words# stem_text = stem_text.replace("germani", "")### append the text to feature_data
feature_data.append(stem_text)### append a 0 to label_data
if text is from Sara, and 1
if text is from Chris
if (name == "accurate"):
label_data.append("0")
elif(name == "inaccurate"):
label_data.append("1")
line = text.readline()
text.close()
print("texts processed")
accurate_files.close()
inaccurate_files.close()
pickle.dump(feature_data, open("./createdDataset/dataSet.pkl", "wb"))
pickle.dump(label_data, open("./createdDataset/dataLabel.pkl", "wb"))
Also I want to know whether i can incrementally train the classifier meaning thereby that retrain a created model with newer data for refining the model over time?
I would be really glad if someone can help me out with this. I am really stuck at this point.
You are already using your model to predict labels of emails in your test set. This is what pred = clf.predict(features_test) does. If you want to see these labels, do print pred.
But perhaps you what to know how you can predict labels for emails that you discover in the future and that are not currently in your test set? If so, you can think of your new email(s) as a new test set. As with your previous test set, you will need to run several key processing steps on the data:
1) The first thing you need to do is to generate features for your new email data. The feature generation step is not included in your code above, but will need to occur.
2) You are using a Tfidf vectorizer, which converts a collection of documents to a matrix of Tfidf features based upon term frequency and inverse document frequency. You need to put your new email test feature data through the vectorizer that you fit on your training data.
3) Then your new email test feature data will need to go through dimensionality reduction using the same selector that you fit on your training data.
4) Finally, run predict on your new test data. Use print pred if you want to view the new label(s).
To respond to your final question about iteratively re-training your model, yes you definitely can do this. It's just a matter of selecting a frequency, producing a script that expands your data set with incoming data, then re-running all steps from there, from pre-processing to Tfidf vectorization, to dimensionality reduction, to fitting, and prediction.

Resources