Classify text using NaiveBayesClassifier - python-3.x

I have a text file with a sentence on each line:
eg ""Have you registered your email ID with your Bank Account?"
I want to classify it into interrogative or not. FYI these are sentences from bank websites.
I've seen this answer
with this nltk code block:
import nltk
nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()[:10000]
def dialogue_act_features(post):
features = {}
for word in nltk.word_tokenize(post):
features['contains({})'.format(word.lower())] = True
return features
featuresets = [(dialogue_act_features(post.text), post.get('class')) for post in posts]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
So I did some preprocessing to my text file i.e. stemming words, removing stop words etc, to make each sentence into a bag of words. From the code above, I have a trained classifier. How do I implement it on my text file of sentences (either raw or preprocessed)?
Update: here is an example of my text file.

Assuming that you have preprocessed the document data as we discussed, you can do the following:
import nltk
nltk.download('nps_chat')
posts = nltk.corpus.nps_chat.xml_posts()[:10000]
def dialogue_act_features(post):
features = {}
for word in nltk.word_tokenize(post):
features['contains({})'.format(word.lower())] = True
return features
featuresets = [(dialogue_act_features(post.text), post.get('class')) for post in posts]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(featuresets)
print(nltk.classify.accuracy(classifier, test_set))
0.668
For your data, you can iterate in your lines and fit, predict:
classifier = nltk.NaiveBayesClassifier.train(featuresets)
print(classifier.classify(dialogue_act_features(line)))

Doing this for all lines in the text file works:
classifier = nltk.NaiveBayesClassifier.train(featuresets)
print(classifier.classify(dialogue_act_features(line)))

Related

Text Classification on a custom dataset with spacy v3

I am really struggling to make things work with the new spacy v3 version. The documentation is full. However, I am trying to run a training loop in a script.
(I am also not able to perform text classification training with CLI approach).
Data are publically available here.
import pandas as pd
from spacy.training import Example
import random
TRAIN_DATA = pd.read_json('data.jsonl', lines = True)
nlp = spacy.load('en_core_web_sm')
config = {
"threshold": 0.5,
}
textcat = nlp.add_pipe("textcat", config=config, last=True)
label = TRAIN_DATA['label'].unique()
for label in label:
textcat.add_label(str(label))
nlp = spacy.blank("en")
nlp.begin_training()
# Loop for 10 iterations
for itn in range(100):
# Shuffle the training data
losses = {}
TRAIN_DATA = TRAIN_DATA.sample(frac = 1)
# Batch the examples and iterate over them
for batch in spacy.util.minibatch(TRAIN_DATA.values, size=4):
texts = [nlp.make_doc(text) for text, entities in batch]
annotations = [{"cats": entities} for text, entities in batch]
# uses an example object rather than text/annotation tuple
print(texts)
print(annotations)
examples = [Example.from_dict(a)]
nlp.update(examples, losses=losses)
if itn % 20 == 0:
print(losses)

Retraining pre-trained word embeddings in Python using Gensim

I want to retrain pre-trained word embeddings in Python using Gensim. The pre-trained embeddings I want to use is Google's Word2Vec in the file GoogleNews-vectors-negative300.bin.
Following Gensim's word2vec tutorial, "it’s not possible to resume training with models generated by the C tool, load_word2vec_format(). You can still use them for querying/similarity, but information vital for training (the vocab tree) is missing there."
Therefore I can't use the KeyedVectors and for training a model the tutorial suggests to use:
model = gensim.models.Word2Vec.load('/tmp/mymodel')
model.train(more_sentences)
(https://rare-technologies.com/word2vec-tutorial/)
However, when I try this:
from gensim.models import Word2Vec
model = Word2Vec.load('data/GoogleNews-vectors-negative300.bin')
I get an error message:
1330 # Because of loading from S3 load can't be used (missing readline in smart_open)
1331 if sys.version_info > (3, 0):
-> 1332 return _pickle.load(f, encoding='latin1')
1333 else:
1334 return _pickle.loads(f.read())
UnpicklingError: invalid load key, '3'.
I didn't find a way to convert the binary google new file into a text file properly, and even if so I'm not sure whether that would solve my problem.
Does anyone have a solution to this problem or knows about a different way to retrain pre-trained word embeddings?
The Word2Vec.load() method can only load full models in gensim's native format (based on Python object-pickling) – not any other binary/text formats.
And, as per the documentation's note that "it’s not possible to resume training with models generated by the C tool", there's simply not enough information in the GoogleNews raw-vectors files to reconstruct the full working model that was used to train them. (That would require both some internal model-weights, not saved in that file, and word-frequency-information for controlling sampling, also not saved in that file.)
The best you could do is create a new Word2Vec model, then patch some/all of the GoogleNews vectors into it before doing your own training. This is an error-prone process with no real best-practices and many caveats about the interpretation of final results. (For example, if you bring in all the vectors, but then only re-train a subset using only your own corpus & word-frequencies, the more training you do – making the word-vectors better fit your corpus – the less such re-trained words will have any useful comparability to retained untrained words.)
Essentially, if you can look at the gensim Word2Vec source & work-out how to patch-together such a frankenstein-model, it may be appropriate. But there's no built-in support or handy off-the-shelf recipes that make it easy, because it's an inherently murky process.
I have already answered it here .
Save the google news model as text file in wor2vec format using gensim.
Refer this answer to save it as text file
Then try this code .
import os
import pickle
import numpy as np
import gensim
from gensim.models import Word2Vec, KeyedVectors
from gensim.models.callbacks import CallbackAny2Vec
import operator
os.mkdir("model_dir")
# class EpochSaver(CallbackAny2Vec):
# '''Callback to save model after each epoch.'''
# def __init__(self, path_prefix):
# self.path_prefix = path_prefix
# self.epoch = 0
# def on_epoch_end(self, model):
# list_of_existing_files = os.listdir(".")
# output_path = 'model_dir/{}_epoch{}.model'.format(self.path_prefix, self.epoch)
# try:
# model.save(output_path)
# except:
# model.wv.save_word2vec_format('model_dir/model_{}.bin'.format(self.epoch), binary=True)
# print("number of epochs completed = {}".format(self.epoch))
# self.epoch += 1
# list_of_total_files = os.listdir(".")
# saver = EpochSaver("my_finetuned")
# function to load vectors from existing model.
# I am loading glove vectors from a text file, benefit of doing this is that I get complete vocab of glove as well.
# If you are using a previous word2vec model I would recommed save that in txt format.
# In case you decide not to do it, you can tweak the function to get vectors for words in your vocab only.
def load_vectors(token2id, path, limit=None):
embed_shape = (len(token2id), 300)
freqs = np.zeros((len(token2id)), dtype='f')
vectors = np.zeros(embed_shape, dtype='f')
i = 0
with open(path, encoding="utf8", errors='ignore') as f:
for o in f:
token, *vector = o.split(' ')
token = str.lower(token)
if len(o) <= 100:
continue
if limit is not None and i > limit:
break
vectors[token2id[token]] = np.array(vector, 'f')
i += 1
return vectors
# path of text file of your word vectors.
embedding_name = "word2vec.txt"
data = "<training data(new line separated tect file)>"
# Dictionary to store a unique id for each token in vocab( in my case vocab contains both my vocab and glove vocab)
token2id = {}
# This dictionary will contain all the words and their frequencies.
vocab_freq_dict = {}
# Populating vocab_freq_dict and token2id from my data.
id_ = 0
training_examples = []
file = open("{}".format(data),'r', encoding="utf-8")
for line in file.readlines():
words = line.strip().split(" ")
training_examples.append(words)
for word in words:
if word not in vocab_freq_dict:
vocab_freq_dict.update({word:0})
vocab_freq_dict[word] += 1
if word not in token2id:
token2id.update({word:id_})
id_ += 1
# Populating vocab_freq_dict and token2id from glove vocab.
max_id = max(token2id.items(), key=operator.itemgetter(1))[0]
max_token_id = token2id[max_id]
with open(embedding_name, encoding="utf8", errors='ignore') as f:
for o in f:
token, *vector = o.split(' ')
token = str.lower(token)
if len(o) <= 100:
continue
if token not in token2id:
max_token_id += 1
token2id.update({token:max_token_id})
vocab_freq_dict.update({token:1})
with open("vocab_freq_dict","wb") as vocab_file:
pickle.dump(vocab_freq_dict, vocab_file)
with open("token2id", "wb") as token2id_file:
pickle.dump(token2id, token2id_file)
# converting vectors to keyedvectors format for gensim
vectors = load_vectors(token2id, embedding_name)
vec = KeyedVectors(300)
vec.add(list(token2id.keys()), vectors, replace=True)
# setting vectors(numpy_array) to None to release memory
vectors = None
params = dict(min_count=1,workers=14,iter=6,size=300)
model = Word2Vec(**params)
# using build from vocab to build the vocab
model.build_vocab_from_freq(vocab_freq_dict)
# using token2id to create idxmap
idxmap = np.array([token2id[w] for w in model.wv.index2entity])
# Setting hidden weights(syn0 = between input layer and hidden layer) = your vectors arranged accoring to ids
model.wv.vectors[:] = vec.vectors[idxmap]
# Setting hidden weights(syn0 = between hidden layer and output layer) = your vectors arranged accoring to ids
model.trainables.syn1neg[:] = vec.vectors[idxmap]
model.train(training_examples, total_examples=len(training_examples), epochs=model.epochs)
output_path = 'model_dir/final_model.model'
model.save(output_path)

determine most similar phrase using word2vec

I try create a model which determine the most similar sentence for another sentence using word2vec.
The idea is to determine the most similar for a sentence, I created an average vector for the words composed this sentence.
Then, I should to predict the most similar sentence using embedding words.
My question is: How can I determine the best similar target sentence after created an average vector of source sentence?
Here the code :
import gensim
from gensim import utils
import numpy as np
import sys
from sklearn.datasets import fetch_20newsgroups
from nltk import word_tokenize
from nltk import download
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)
download('punkt') #tokenizer, run once
download('stopwords') #stopwords dictionary, run once
stop_words = stopwords.words('english')
def preprocess(text):
text = text.lower()
doc = word_tokenize(text)
doc = [word for word in doc if word not in stop_words]
doc = [word for word in doc if word.isalpha()] #restricts string to alphabetic characters only
return doc
############ doc content -> num label -> string label
#note to self: texts[XXXX] -> y[XXXX] = ZZZ -> ng20.target_names[ZZZ]
# Fetch ng20 dataset
ng20 = fetch_20newsgroups(subset='all',
remove=('headers', 'footers', 'quotes'))
# text and ground truth labels
texts, y = ng20.data, ng20.target
corpus = [preprocess(text) for text in texts]
def filter_docs(corpus, texts, labels, condition_on_doc):
"""
Filter corpus, texts and labels given the function condition_on_doc which takes
a doc.
The document doc is kept if condition_on_doc(doc) is true.
"""
number_of_docs = len(corpus)
print(number_of_docs)
if texts is not None:
texts = [text for (text, doc) in zip(texts, corpus)
if condition_on_doc(doc)]
labels = [i for (i, doc) in zip(labels, corpus) if condition_on_doc(doc)]
corpus = [doc for doc in corpus if condition_on_doc(doc)]
print("{} docs removed".format(number_of_docs - len(corpus)))
return (corpus, texts, labels)
corpus, texts, y = filter_docs(corpus, texts, y, lambda doc: (len(doc) != 0))
def document_vector(word2vec_model, doc):
# remove out-of-vocabulary words
#print("doc:")
#print(doc)
doc = [word for word in doc if word in word2vec_model.vocab]
return np.mean(word2vec_model[doc], axis=0)
def has_vector_representation(word2vec_model, doc):
"""check if at least one word of the document is in the
word2vec dictionary"""
return not all(word not in word2vec_model.vocab for word in doc)
corpus, texts, y = filter_docs(corpus, texts, y, lambda doc: has_vector_representation(model, doc))
x =[]
for doc in corpus: #look up each doc in model
x.append(document_vector(model, doc))
X = np.array(x) #list to array
model.most_similar(positive=X, topn=1)
Just use the cosine distance. It's implemented in scipy.
For better efficiency, you can implement it yourself and precompute the norms of vectors in X:
X_norm = np.linalg.norm(X, axis=1).expand_dims(0)
Calling expand_dims ensures that dimensions got broadcasted. Then for vectors Y, you can get the most similar, you can get the most similar:
def get_most_similar_in_X(Y):
Y_norm = np.linalg.norm(Y, axis=1).expand_dims(1)
similarities = np.dot(Y, X.T) / Y_norm / X_norm
return np.argmax(distances, axis=2)
And you get indices of vectors in X that are most similar to vectors in Y.

How to do Text classification using word2vec

I want to perform text classification using word2vec.
I got vectors of words.
ls = []
sentences = lines.split(".")
for i in sentences:
ls.append(i.split())
model = Word2Vec(ls, min_count=1, size = 4)
words = list(model.wv.vocab)
print(words)
vectors = []
for word in words:
vectors.append(model[word].tolist())
data = np.array(vectors)
data
output:
array([[ 0.00933912, 0.07960335, -0.04559333, 0.10600036],
[ 0.10576613, 0.07267512, -0.10718666, -0.00804013],
[ 0.09459028, -0.09901826, -0.07074171, -0.12022413],
[-0.09893986, 0.01500741, -0.04796079, -0.04447284],
[ 0.04403428, -0.07966098, -0.06460238, -0.07369237],
[ 0.09352681, -0.03864434, -0.01743148, 0.11251986],.....])
How can i perform classification (product & non product)?
You already have the array of word vectors using model.wv.syn0. If you print it, you can see an array with each corresponding vector of a word.
You can see an example here using Python3:
import pandas as pd
import os
import gensim
import nltk as nl
from sklearn.linear_model import LogisticRegression
#Reading a csv file with text data
dbFilepandas = pd.read_csv('machine learning\\Python\\dbSubset.csv').apply(lambda x: x.astype(str).str.lower())
train = []
#getting only the first 4 columns of the file
for sentences in dbFilepandas[dbFilepandas.columns[0:4]].values:
train.extend(sentences)
# Create an array of tokens using nltk
tokens = [nl.word_tokenize(sentences) for sentences in train]
Now it's time to use the vector model, in this example we will calculate the LogisticRegression.
# method 1 - using tokens in Word2Vec class itself so you don't need to train again with train method
model = gensim.models.Word2Vec(tokens, size=300, min_count=1, workers=4)
# method 2 - creating an object 'model' of Word2Vec and building vocabulary for training our model
model = gensim.models.Word2vec(size=300, min_count=1, workers=4)
# building vocabulary for training
model.build_vocab(tokens)
print("\n Training the word2vec model...\n")
# reducing the epochs will decrease the computation time
model.train(tokens, total_examples=len(tokens), epochs=4000)
# You can save your model if you want....
# The two datasets must be the same size
max_dataset_size = len(model.wv.syn0)
Y_dataset = []
# get the last number of each file. In this case is the department number
# this will be the 0 or 1, or another kind of classification. ( to use words you need to extract them differently, this way is to numbers)
with open("dbSubset.csv", "r") as f:
for line in f:
lastchar = line.strip()[-1]
if lastchar.isdigit():
result = int(lastchar)
Y_dataset.append(result)
else:
result = 40
clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(model.wv.syn0, Y_dataset[:max_dataset_size])
# Prediction of the first 15 samples of all features
predict = clf.predict(model.wv.syn0[:15, :])
# Calculating the score of the predictions
score = clf.score(model.wv.syn0, Y_dataset[:max_dataset_size])
print("\nPrediction word2vec : \n", predict)
print("Score word2vec : \n", score)
You can also calculate the similarity of words belonging to your created model dictionary:
print("\n\nSimilarity value : ",model.wv.similarity('women','men'))
You can find more functions to use here.
Your question is rather broad but I will try to give you a first approach to classify text documents.
First of all, I would decide how I want to represent each document as one vector. So you need a method that takes a list of vectors (of words) and returns one single vector. You want to avoid that the length of the document influences what this vector represents. You could for example choose the mean.
def document_vector(array_of_word_vectors):
return array_of_word_vectors.mean(axis=0)
where array_of_word_vectors is for example data in your code.
Now you can either play a bit around with distances (for example cosine distance would a nice first choice) and see how far certain documents are from each other or - and that's probably the approach that brings faster results - you can use the document vectors to build a training set for a classification algorithm of your choice from scikit learn, for example Logistic Regression.
The document vectors will become your matrix X and your vector y is an array of 1 and 0, depending on the binary category that you want the documents to be classified into.

How to predict Label of an email using a trained NB Classifier in sklearn?

I have created a Gaussian Naive Bayes classifier on a email (spam/not spam) dataset and was able to run it successfully. I vectorized the data, divided in it train and test sets and then calculated the accuracy, all the features that are present in the sklearn-Gaussian Naive Bayes classifier.
Now I want to be able to use this classifier to predict "labels" for new emails - whether they are by spam or not.
For example say I have an email. I want to feed it to my classifier and get the prediction as to whether it is a spam or not. How can I achieve this? Please Help.
Code for classifier file.
#!/usr/bin/python
import sys
from time import time
import logging
# Display progress logs on stdout
logging.basicConfig(level = logging.DEBUG, format = '%(asctime)s %(message)s')
sys.path.append("../DatasetProcessing/")
from vectorize_split_dataset import preprocess
### features_train and features_test are the features
for the training and testing datasets, respectively### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()
#########################################################
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
t0 = time()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print("training time:", round(time() - t0, 3), "s")
print(clf.score(features_test, labels_test))
## Printing Metrics
for Training and Testing
print("No. of Testing Features:" + str(len(features_test)))
print("No. of Testing Features Label:" + str(len(labels_test)))
print("No. of Training Features:" + str(len(features_train)))
print("No. of Training Features Label:" + str(len(labels_train)))
print("No. of Predicted Features:" + str(len(pred)))
## Calculating Classifier Performance
from sklearn.metrics import classification_report
y_true = labels_test
y_pred = pred
labels = ['0', '1']
target_names = ['class 0', 'class 1']
print(classification_report(y_true, y_pred, target_names = target_names, labels = labels))
# How to predict label of a new text
new_text = "You won a lottery at UK lottery commission. Reply to claim it"
Code for Vectorization
#!/usr/bin/python
import os
import pickle
import numpy
numpy.random.seed(42)
path = os.path.dirname(os.path.abspath(__file__))
### The words(features) and label_data(labels), already largely processed.###These files should have been created beforehand
feature_data_file = path + "./createdDataset/dataSet.pkl"
label_data_file = path + "./createdDataset/dataLabel.pkl"
feature_data = pickle.load(open(feature_data_file, "rb"))
label_data = pickle.load(open(label_data_file, "rb"))
### test_size is the percentage of events assigned to the test set(the### remainder go into training)### feature matrices changed to dense representations
for compatibility with### classifier functions in versions 0.15.2 and earlier
from sklearn import cross_validation
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(feature_data, label_data, test_size = 0.1, random_state = 42)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf = True, max_df = 0.5, stop_words = 'english')
features_train = vectorizer.fit_transform(features_train)
features_test = vectorizer.transform(features_test)#.toarray()
## feature selection to reduce dimensionality
from sklearn.feature_selection import SelectPercentile, f_classif
selector = SelectPercentile(f_classif, percentile = 5)
selector.fit(features_train, labels_train)
features_train_transformed_reduced = selector.transform(features_train).toarray()
features_test_transformed_reduced = selector.transform(features_test).toarray()
features_train = features_train_transformed_reduced
features_test = features_test_transformed_reduced
def preprocess():
return features_train, features_test, labels_train, labels_test
Code for dataset generation
#!/usr/bin/python
import os
import pickle
import re
import sys
# sys.path.append("../tools/")
""
"
Starter code to process the texts of accuate and inaccurate category to extract
the features and get the documents ready for classification.
The list of all the texts from accurate category are in the accurate_files list
likewise for texts of inaccurate category are in (inaccurate_files)
The data is stored in lists and packed away in pickle files at the end.
"
""
accurate_files = open("./rawDatasetLocation/accurateFiles.txt", "r")
inaccurate_files = open("./rawDatasetLocation/inaccurateFiles.txt", "r")
label_data = []
feature_data = []
### temp_counter is a way to speed up the development--there are### thousands of lines of accurate and inaccurate text, so running over all of them### can take a long time### temp_counter helps you only look at the first 200 lines in the list so you### can iterate your modifications quicker
temp_counter = 0
for name, from_text in [("accurate", accurate_files), ("inaccurate", inaccurate_files)]:
for path in from_text: ###only look at first 200 texts when developing### once everything is working, remove this line to run over full dataset
temp_counter = 1
if temp_counter < 200:
path = os.path.join('..', path[: -1])
print(path)
text = open(path, "r")
line = text.readline()
while line: ###use a
function parseOutText to extract the text from the opened text# stem_text = parseOutText(text)
stem_text = text.readline().strip()
print(stem_text)### use str.replace() to remove any instances of the words# stem_text = stem_text.replace("germani", "")### append the text to feature_data
feature_data.append(stem_text)### append a 0 to label_data
if text is from Sara, and 1
if text is from Chris
if (name == "accurate"):
label_data.append("0")
elif(name == "inaccurate"):
label_data.append("1")
line = text.readline()
text.close()
print("texts processed")
accurate_files.close()
inaccurate_files.close()
pickle.dump(feature_data, open("./createdDataset/dataSet.pkl", "wb"))
pickle.dump(label_data, open("./createdDataset/dataLabel.pkl", "wb"))
Also I want to know whether i can incrementally train the classifier meaning thereby that retrain a created model with newer data for refining the model over time?
I would be really glad if someone can help me out with this. I am really stuck at this point.
You are already using your model to predict labels of emails in your test set. This is what pred = clf.predict(features_test) does. If you want to see these labels, do print pred.
But perhaps you what to know how you can predict labels for emails that you discover in the future and that are not currently in your test set? If so, you can think of your new email(s) as a new test set. As with your previous test set, you will need to run several key processing steps on the data:
1) The first thing you need to do is to generate features for your new email data. The feature generation step is not included in your code above, but will need to occur.
2) You are using a Tfidf vectorizer, which converts a collection of documents to a matrix of Tfidf features based upon term frequency and inverse document frequency. You need to put your new email test feature data through the vectorizer that you fit on your training data.
3) Then your new email test feature data will need to go through dimensionality reduction using the same selector that you fit on your training data.
4) Finally, run predict on your new test data. Use print pred if you want to view the new label(s).
To respond to your final question about iteratively re-training your model, yes you definitely can do this. It's just a matter of selecting a frequency, producing a script that expands your data set with incoming data, then re-running all steps from there, from pre-processing to Tfidf vectorization, to dimensionality reduction, to fitting, and prediction.

Resources