How can I calculate perplexity using nltk

How can I calculate perplexity using nltk - python-3.x

I try to do some process on a text. It's part of my code:
fp = open(train_file)
raw = fp.read()
sents = fp.readlines()
words = nltk.tokenize.word_tokenize(raw)
bigrams = ngrams(words,2, left_pad_symbol='<s>', right_pad_symbol=</s>)
fdist = nltk.FreqDist(words)
In the old versions of nltk I found this code on StackOverflow for perplexity
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(5, train, estimator=estimator)
print("len(corpus) = %s, len(vocabulary) = %s, len(train) = %s, len(test) = %s" % ( len(corpus), len(vocabulary), len(train), len(test) ))
print("perplexity(test) =", lm.perplexity(test))
However, this code is no longer valid, and I didn't find any other package or function in nltk for this purpose. Should I implement it?

Perplexity
Lets assume we have a model which takes as input an English sentence and gives out a probability score corresponding to how likely its is a valid English sentence. We want to determined how good this model is. A good model should give high score to valid English sentences and low score to invalid English sentences. Perplexity is a popularly used measure to quantify how "good" such a model is. If a sentence s contains n words then perplexity
Modeling probability distribution p (building the model)
can be expanded using chain rule of probability
So given some data (called train data) we can calculated the above conditional probabilities. However, practically it is not possible as it will requires huge amount of training data. We then make assumption to calculate
Assumption : All words are independent (unigram)
Assumption : First order Markov assumption (bigram)
Next words depends only on the previous word
Assumption : n order Markov assumption (ngram)
Next words depends only on the previous n words
MLE to estimate probabilities
Maximum Likelihood Estimate(MLE) is one way to estimate the individual probabilities
Unigram
where
count(w) is number of times the word w appears in the train data
count(vocab) is the number of uniques words (called vocabulary) in the train data.
Bigram
where
count(w_{i-1}, w_i) is number of times the words w_{i-1}, w_i appear together in same sequence (bigram) in the train data
count(w_{i-1}) is the number of times the word w_{i-1} appear in the train data. w_{i-1} is called context.
Calculating Perplexity
As we have seen above $p(s)$ is calculated by multiplying lots of small numbers and so it is not numerically stable because of limited precision of floating point numbers on a computer. Lets use the nice properties of log to simply it. We know
Example: Unigram model
Train Data ["an apple", "an orange"]
Vocabulary : [an, apple, orange, UNK]
MLE estimates
For test sentence "an apple"
l = (np.log2(0.5) + np.log2(0.25))/2 = -1.5
np.power(2, -l) = 2.8284271247461903
For test sentence "an ant"
l = (np.log2(0.5) + np.log2(0))/2 = inf
Code
import nltk
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE
train_sentences = ['an apple', 'an orange']
tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent)))
for sent in train_sentences]
n = 1
train_data, padded_vocab = padded_everygram_pipeline(n, tokenized_text)
model = MLE(n)
model.fit(train_data, padded_vocab)
test_sentences = ['an apple', 'an ant']
tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent)))
for sent in test_sentences]
test_data, _ = padded_everygram_pipeline(n, tokenized_text)
for test in test_data:
print ("MLE Estimates:", [((ngram[-1], ngram[:-1]),model.score(ngram[-1], ngram[:-1])) for ngram in test])
test_data, _ = padded_everygram_pipeline(n, tokenized_text)
for i, test in enumerate(test_data):
print("PP({0}):{1}".format(test_sentences[i], model.perplexity(test)))
Example: Bigram model
Train Data: "an apple", "an orange"
Padded Train Data: "(s) an apple (/s)", "(s) an orange (/s)"
Vocabulary : (s), (/s) an, apple, orange, UNK
MLE estimates
For test sentence "an apple" Padded : "(s) an apple (/s)"
l = (np.log2(p(an|<s> ) + np.log2(p(apple|an) + np.log2(p(</s>|apple))/3 =
(np.log2(1) + np.log2(0.5) + np.log2(1))/3 = -0.3333
np.power(2, -l) = 1.
For test sentence "an ant" Padded : "(s) an ant (/s)"
l = (np.log2(p(an|<s> ) + np.log2(p(ant|an) + np.log2(p(</s>|ant))/3 = inf
Code
import nltk
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE
from nltk.lm import Vocabulary
train_sentences = ['an apple', 'an orange']
tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) for sent in train_sentences]
n = 2
train_data = [nltk.bigrams(t, pad_right=True, pad_left=True, left_pad_symbol="<s>", right_pad_symbol="</s>") for t in tokenized_text]
words = [word for sent in tokenized_text for word in sent]
words.extend(["<s>", "</s>"])
padded_vocab = Vocabulary(words)
model = MLE(n)
model.fit(train_data, padded_vocab)
test_sentences = ['an apple', 'an ant']
tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) for sent in test_sentences]
test_data = [nltk.bigrams(t, pad_right=True, pad_left=True, left_pad_symbol="<s>", right_pad_symbol="</s>") for t in tokenized_text]
for test in test_data:
print ("MLE Estimates:", [((ngram[-1], ngram[:-1]),model.score(ngram[-1], ngram[:-1])) for ngram in test])
test_data = [nltk.bigrams(t, pad_right=True, pad_left=True, left_pad_symbol="<s>", right_pad_symbol="</s>") for t in tokenized_text]
for i, test in enumerate(test_data):
print("PP({0}):{1}".format(test_sentences[i], model.perplexity(test)))

Related

BERT with WMD distance for sentence similarity

I have tried to calculate the similarity between the two sentences using BERT and word mover distance (WMD). I am unable to find the correct formula for WMD in python. Also tried the WMD python library but it uses the word2vec model for embedding. Kindly help to solve the below problem to get the similarity score using WMD.
sentence_obama = 'Obama speaks to the media in Illinois'
sentence_president = 'The president greets the press in Chicago'
sentence_obama = sentence_obama.lower().split()
sentence_president = sentence_president.lower().split()
#Importing bert for creating an embedding
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/bert-base-nli-mean-tokens')
#creating an embedding of both sentences
sentence_embeddings1 = model.encode(sentence_obama)
sentence_embeddings2 = model.encode(sentence_president)
distance = WMD(sentence_embeddings1, sentence_embeddings2)
print(distance)

Generally speaking, Word Mover Distance (based on Earth Mover Distance) requires a representation which each feature is associated with weight (or density). For examples bag-of-word representation of sentences with histogram of words.
Intuitively, EMD measures the cost of moving wights (dirt) in a histogram representation of features knowing the ground distance between each feature. With words as features, word vectors provide a distance measure between words, and then EMD can become WMD with word-histograms.
There are two issues with using WMD on BERT embeddings:
BERT embeddings provide contextual representation of sub-words and the sentence (representation of of a subword changes in different context).
There is no measure of density or weight on words and sub-words other than the attention mask on tokens.
The most simple and effective sentence similarity measure with BERT is based on the distance between [CLS] vectors of two sentences (the first vectors in the last hidden layers: the sentence vectors).
With all that said, I will try to find alternative ways to use WMD using pyemd module as in this Gensim implementation of WMD.
To measure which solution actually works, I will evaluate different solutions on this sentence similarity dataset in English.
import datasets
dataset = datasets.load_dataset('stsb_multi_mt', 'en')
Instead of sentence_transformers module, I use the main huggingface transformers. For simplicity I will use the following function to get tokens and sentence emebdedding for a given string:
from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')
def encode(sent):
inp = tokenizer(sent, return_tensors='pt')
out = model(**inp)
out = out.last_hidden_state[0].detach().numpy()
return out
Do not forget to import these modules as well:
import numpy as np
from pyemd import emd
from scipy.spatial.distance import cdist
from scipy.stats import spearmanr
We use cdist to measure vector distances, and Spearman's rank-order correlation (spearmanr) to compare our predicted similarity measure with the human judgments.
true_scores = []
pred_cls_scores = []
for item in tqdm(dataset['test']):
sent1 = encode(item['sentence1'])
sent2 = encode(item['sentence2'])
true_scores.append(item['similarity_score'])
pred_cls_scores.append(cdist(sent1[:1], sent2[:1])[0, 0])
spearmanr(true_scores, pred_cls_scores)
# SpearmanrResult(correlation=-0.737203146420342, pvalue=1.0236865615739037e-236)
Spearman's rho=0.737 is quite high!
The original post proposes to represent sentences with vectors of words based on white-space tokenization, run WMD over such representation. Here is an implementation of WMD based on EMD module similar to Gensim:
def wmdistance(sent1, sent2):
words1 = sent1.split()
words2 = sent2.split()
embs1 = np.array([encode(word)[0] for word in words1])
embs2 = np.array([encode(word)[0] for word in words2])
vocab_freq = Counter(words1 + words2)
vocab_indices = {w:idx for idx, w in enumerate(vocab_freq)}
sent1_indices = [vocab_indices[w] for w in words1]
sent2_indices = [vocab_indices[w] for w in words2]
vocab_len = len(vocab_freq)
# Compute distance matrix.
distance_matrix = np.zeros((vocab_len, vocab_len), dtype=np.double)
distance_matrix[np.ix_(sent1_indices, sent2_indices)] = cdist(embs1, embs2)
if abs((distance_matrix).sum()) < 1e-8:
# `emd` gets stuck if the distance matrix contains only zeros.
logger.info('The distance matrix is all zeros. Aborting (returning inf).')
return float('inf')
def nbow(sent):
d = np.zeros(vocab_len, dtype=np.double)
nbow = [(vocab_indices[w], vocab_freq[w]) for w in sent]
doc_len = len(sent)
for idx, freq in nbow:
d[idx] = freq / float(doc_len) # Normalized word frequencies.
return d
# Compute nBOW representation of documents. This is what pyemd expects on input.
d1 = nbow(words1)
d2 = nbow(words2)
# Compute WMD.
return emd(d1, d2, distance_matrix)
The spearman correlations are positive but not as high as the standard solution above.
pred_wmd_scores = []
for item in tqdm(dataset['test']):
pred_wmd_scores.append(wmdistance(item['sentence1'], item['sentence2']))
spearmanr(true_scores, pred_wmd_scores)
# SpearmanrResult(correlation=-0.4279390535806689, pvalue=1.6453234927014767e-62)
Perhaps, rho=0.428 is not too low for word-vector representations but it is quite low.
There are also other alternative ways to use EMD on [CLS] vectors. In order to run EMD, we need ground distances between features of the vector. So, one alternative solution is to map embeddings onto a new vector space which [CLS] vectors express weight of more meaningful features. For example, we can create a list of sentence vectors as components of the vector space. Then map the sentence vectors onto the component space, where each sentence is represented with a vector of component weight. The distance between components is measurable in the original embedding space:
def emdistance(embs1, embs2, components):
distance_matrix = cdist(components, components, metric='cosine')
sent_vec1 = 1-cdist(components, embs1[:1], metric='cosine')[:, 0]
sent_vec2 = 1-cdist(components, embs2[:1], metric='cosine')[:, 0]
return emd(sent_vec1, sent_vec2, distance_matrix)
Perhaps it is possible for some applications to find defining sentences as components, here I just sample 20 random sentences to test this:
n = 20
indices = np.arange(len(dataset['train']))
np.random.shuffle(indices)
random_sentences = [dataset['train'][int(idx)]['sentence1'] for idx in indices[:n]]
random_components = np.array([encode(sent)[0] for sent in random_sentences])
pred_emd_scores = []
for item in tqdm(dataset['test']):
sent1 = encode(item['sentence1'])
sent2 = encode(item['sentence2'])
pred_emd_scores.append(emdistance(sent1, sent2, random_components))
spearmanr(true_scores, pred_emd_scores)
#SpearmanrResult(correlation=-0.5347151444976767, pvalue=8.092612264709952e-103)
With 20 random sentences as components still rho=0.534 is a better score than bag of word rho=0.428.

Why train_test_split influences the results of regression so drastically?

I am performing some tests on the house prices prediction competition of Kaggle.
For easiness, find here below the complete process to download, pre-process and start predicting using a simple linear regression model :
download the data
from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate()
saveDir = "data"
if not os.path.exists("data"):
os.makedirs(saveDir)
api.competition_download_files("house-prices-advanced-regression-techniques","data")
print("the following files have been downloaded \n" + '\n'.join('{}'.format(item) for item in os.listdir("data")))
print("they are located in " + saveDir)
get train and test data
train = pd.read_csv(saveDir + r"\train.csv")
test = pd.read_csv(saveDir + r"\test.csv")
xTrain = train.iloc[:,1:-1] # remove id & SalePrice
yTrain = train.iloc[:,-1] # SalePrice
xTest = test.iloc[:,1:] # remove id
split numerical and test data
catData = xTrain.columns[xTrain.dtypes == object]
numData = list(set(xTrain.columns) - set(catData))
print("The number of columns in the original dataframe is " + str(len(xTrain.columns)))
print("The number of columns in the categorical and numerical data dds up to " + str(len(catData)+len(numData)))
Define a cleaning function to deal with NaN / None
def cleanData(data, catData, numData) :
dataClean = data.copy()
# Let's deal with NaN ...
# check where there are NaN in categorical
dataClean[catData].columns[dataClean[catData].isna().any(axis=0)]
# take care that some categorical could be numerics so
# differentiate the two cases
dataTypes = [dataClean.loc[dataClean.loc[:,col].notnull(),col].apply(type).iloc[0] for col in catData] # get the data type for each column
# have to be carefull to not take a data that is NaN or None
# when evaluating its type
from itertools import compress
catDataNum = [True if ((col == float) | (col == int)) else False for col in dataTypes ] # if data type is numeric (float/int), register it
catDataNum = list(compress(catData, catDataNum))
catDataNotNum = list(set(catData)-set(catDataNum))
print("The number of columns in the dataframe is " + str(len(dataClean.columns)))
print("The number of columns in the categorical and numerical data dds up to " +
str(len(catDataNum) + len(catDataNotNum)+len(numData)))
# Check what NA means for each feature ...
# BsmtQual : NA means no basement
# GarageType : NA means no garage
# BsmtExposure : NA means no basement
# Alley : NA means no alley access
# BsmtFinType2 : NA means no basement
# GarageFinish : NA means no garage
# did not check the rest ... I will just replace with a category "No"
# For categorical, NaN values mean the considered feature
# do not exist (this requires dataset analysis as performed above)
dataClean[catDataNotNum] = dataClean[catDataNotNum].fillna(value = 'No')
mean = dataClean[catDataNum].mean()
dataClean[catDataNum] = dataClean[catDataNum].fillna(value = mean)
# for numerical, replace with mean
mean = dataClean[numData].mean()
dataClean[numData] = dataClean[numData].fillna(value = mean)
return dataClean
Perform the cleaning
xTrainClean = cleanData(xTrain, catData, numData)
# check if no NaN or None anymore
if xTrainClean.isna().sum().sum() != 0:
print(xTrainClean.iloc[:,xTrainClean.isna().any(axis=0).values])
else :
print("All good! No more NaN or None in training data!")
# same with test data
# perform the cleaning
xTestClean = cleanData(xTest, catData, numData)
# check if no NaN or None anymore
if xTestClean.isna().sum().sum() != 0:
print(xTestClean.iloc[:,xTestClean.isna().any(axis=0).values])
else :
print("All good! No more NaN or None in test data!")
Pre-process data i.e. one-hot encode the categorical features
import sklearn as sk
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
# We would like to perform a linear regression on all data
# but some data are categorical ...
# so first, perform a one-hot encoding on categorical variables
ct = ColumnTransformer(transformers = [("OneHotEncoder", OneHotEncoder(categories='auto', drop=None,
sparse=False, n_values='auto',
handle_unknown = "error"),
catData)],
remainder = "passthrough")
ct = ct.fit(pd.concat([xTrainClean, xTestClean])) # fit on both xTrain & xTest to be sure to have all possible categorical values
# test it separately (.fit(xTrain) / .fit(xTest) and analyze to understand)
# resulting categories and values can be obtained through
# ct.named_transformers_ ['OneHotEncoder'].categories_
xTrainOneHot = ct.transform(xTrainClean)
Split training data into an "internal" training and test sets
xTestOneHotKaggle = xTestOneHot.copy()
from sklearn.model_selection import train_test_split
xTrainInternalOneHot, xTestInternalOneHot, yTrainInternal, yTestInternal = train_test_split(xTrainOneHot, yTrain, test_size=0.5, random_state=42, shuffle = False)
print("The training data now contains " + str(xTrainInternalOneHot.shape[0]) + " samples")
print("The training data now contains " + str(yTrainInternal.shape[0]) + " labels")
print("The test data now contains " + str(xTestInternalOneHot.shape[0]) + " samples")
print("The test data now contains " + str(yTestInternal.shape[0]) + " labels")
Train ... And this is where the funny part is
reg = LinearRegression().fit(xTrainInternalOneHot,yTrainInternal)
yTrainInternalPredict = reg.predict(xTrainInternalOneHot)
yTestInternalPredict = reg.predict(xTestInternalOneHot)
print("The R2 score on training data is equal to " + str(reg.score(xTrainInternalOneHot,yTrainInternal)))
print("The R2 score on the internal test data is equal to " + str(reg.score(xTestInternalOneHot, yTestInternal)))
from sklearn.metrics import mean_squared_log_error
print("Tke Kaggle metric score (RMSLE) on internal training data is equal to " +
str(np.sqrt(mean_squared_log_error(yTrainInternal, yTrainInternalPredict))))
print("Tke Kaggle metric score (RMSLE) on internal test data is equal to " +
str(np.sqrt(mean_squared_log_error(yTestInternal, yTestInternalPredict))))
Question
So with the above process, one will get an error when computing the Kaggle metric i.e. RMLSE because some values are negative. The funny thing is that if I change the test_size parameter from 0.5 to 0.2 then no more negative values. One could understand as more data gets used to train so the model performs better. But if I move it from 0.2 to 0.3 (less dramatic change i.e. ~100 training samples) then the issue of the model predicting negative values appear again.
Two questions :
Is this expected i.e. that the model is so sensitive to
training data ? This is even clearer because if test_size = 0.2
is used with shuffle = False then it works. If used when shuffle =
True, then the model starts predicting negative values.
How to deal with such behavior ? Obviously, this is a very simple
model (no standardization, no scaling, no regularization ...) but I
believe it is interesting to really understand what is going on in
this very simple model.

Is this expected i.e. that the model is so sensitive to training data ? This is even clearer because if test_size = 0.2 is used with shuffle = False then it works. If used when shuffle = True, then the model starts predicting negative values.
For your question, yes this split can matter!
How to deal with such behavior ? Obviously, this is a very simple model (no standardization, no scaling, no regularization ...) but I believe it is interesting to really understand what is going on in this very simple model.
Did you ever hear about cross-validation?
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
The concept is to train your classifier/regression with several datasplits, which always have a different train/test-split to avoid this behavior you are explaning, then you can really judge your prediction quality, as new data could also have several different structures.
So you run severaal Iterations and then judge about the outcome.

How to do Text classification using word2vec

I want to perform text classification using word2vec.
I got vectors of words.
ls = []
sentences = lines.split(".")
for i in sentences:
ls.append(i.split())
model = Word2Vec(ls, min_count=1, size = 4)
words = list(model.wv.vocab)
print(words)
vectors = []
for word in words:
vectors.append(model[word].tolist())
data = np.array(vectors)
data
output:
array([[ 0.00933912, 0.07960335, -0.04559333, 0.10600036],
[ 0.10576613, 0.07267512, -0.10718666, -0.00804013],
[ 0.09459028, -0.09901826, -0.07074171, -0.12022413],
[-0.09893986, 0.01500741, -0.04796079, -0.04447284],
[ 0.04403428, -0.07966098, -0.06460238, -0.07369237],
[ 0.09352681, -0.03864434, -0.01743148, 0.11251986],.....])
How can i perform classification (product & non product)?

You already have the array of word vectors using model.wv.syn0. If you print it, you can see an array with each corresponding vector of a word.
You can see an example here using Python3:
import pandas as pd
import os
import gensim
import nltk as nl
from sklearn.linear_model import LogisticRegression
#Reading a csv file with text data
dbFilepandas = pd.read_csv('machine learning\\Python\\dbSubset.csv').apply(lambda x: x.astype(str).str.lower())
train = []
#getting only the first 4 columns of the file
for sentences in dbFilepandas[dbFilepandas.columns[0:4]].values:
train.extend(sentences)
# Create an array of tokens using nltk
tokens = [nl.word_tokenize(sentences) for sentences in train]
Now it's time to use the vector model, in this example we will calculate the LogisticRegression.
# method 1 - using tokens in Word2Vec class itself so you don't need to train again with train method
model = gensim.models.Word2Vec(tokens, size=300, min_count=1, workers=4)
# method 2 - creating an object 'model' of Word2Vec and building vocabulary for training our model
model = gensim.models.Word2vec(size=300, min_count=1, workers=4)
# building vocabulary for training
model.build_vocab(tokens)
print("\n Training the word2vec model...\n")
# reducing the epochs will decrease the computation time
model.train(tokens, total_examples=len(tokens), epochs=4000)
# You can save your model if you want....
# The two datasets must be the same size
max_dataset_size = len(model.wv.syn0)
Y_dataset = []
# get the last number of each file. In this case is the department number
# this will be the 0 or 1, or another kind of classification. ( to use words you need to extract them differently, this way is to numbers)
with open("dbSubset.csv", "r") as f:
for line in f:
lastchar = line.strip()[-1]
if lastchar.isdigit():
result = int(lastchar)
Y_dataset.append(result)
else:
result = 40
clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(model.wv.syn0, Y_dataset[:max_dataset_size])
# Prediction of the first 15 samples of all features
predict = clf.predict(model.wv.syn0[:15, :])
# Calculating the score of the predictions
score = clf.score(model.wv.syn0, Y_dataset[:max_dataset_size])
print("\nPrediction word2vec : \n", predict)
print("Score word2vec : \n", score)
You can also calculate the similarity of words belonging to your created model dictionary:
print("\n\nSimilarity value : ",model.wv.similarity('women','men'))
You can find more functions to use here.

Your question is rather broad but I will try to give you a first approach to classify text documents.
First of all, I would decide how I want to represent each document as one vector. So you need a method that takes a list of vectors (of words) and returns one single vector. You want to avoid that the length of the document influences what this vector represents. You could for example choose the mean.
def document_vector(array_of_word_vectors):
return array_of_word_vectors.mean(axis=0)
where array_of_word_vectors is for example data in your code.
Now you can either play a bit around with distances (for example cosine distance would a nice first choice) and see how far certain documents are from each other or - and that's probably the approach that brings faster results - you can use the document vectors to build a training set for a classification algorithm of your choice from scikit learn, for example Logistic Regression.
The document vectors will become your matrix X and your vector y is an array of 1 and 0, depending on the binary category that you want the documents to be classified into.

Using predict on new text with kmeans (sklearn)?

I have a very small list of short strings which I want to (1) cluster and (2) use that model to predict which cluster a new string belongs to.
Running the first part works fine, getting a prediction for the new string does not.
First Part
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
# List of
documents_lst = ['a small, narrow river',
'a continuous flow of liquid, air, or gas',
'a continuous flow of data or instructions, typically one having a constant or predictable rate.',
'a group in which schoolchildren of the same age and ability are taught',
'(of liquid, air, gas, etc.) run or flow in a continuous current in a specified direction',
'transmit or receive (data, especially video and audio material) over the Internet as a steady, continuous flow.',
'put (schoolchildren) in groups of the same age and ability to be taught together',
'a natural body of running water flowing on or under the earth']
# 1. Vectorize the text
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(documents_lst)
print('tfidf_matrix.shape: ', tfidf_matrix.shape)
# 2. Get the number of clusters to make .. (find a better way than random)
num_clusters = 3
# 3. Cluster the defintions
km = KMeans(n_clusters=num_clusters, init='k-means++').fit(tfidf_matrix)
clusters = km.labels_.tolist()
print(clusters)
Which returns:
tfidf_matrix.shape: (8, 39)
[0, 1, 0, 2, 1, 0, 2, 0]
Second Part
The failing part:
predict_doc = ['A stream is a body of water with a current, confined within a bed and banks.']
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(predict_doc)
print('tfidf_matrix.shape: ', tfidf_matrix.shape)
km.predict(tfidf_matrix)
The error:
ValueError: Incorrect number of features. Got 7 features, expected 39
FWIW: I somewhat understand that the training and predict have a different amount of features after vectorizing ...
I am open to any solution including changing from kmeans to an algorithm more suitable for short text clustering.
Thanks in advance

For completeness I will answer my own question with an answer from here , that doesn't answer that question. But answers mine
from sklearn.cluster import KMeans
list1 = ["My name is xyz", "My name is pqr", "I work in abc"]
list2 = ["My name is xyz", "I work in abc"]
vectorizer = TfidfVectorizer(min_df = 0, max_df=0.5, stop_words = "english", charset_error = "ignore", ngram_range = (1,3))
vec = vectorizer.fit(list1) # train vec using list1
vectorized = vec.transform(list1) # transform list1 using vec
km = KMeans(n_clusters=2, init='k-means++', n_init=10, max_iter=1000, tol=0.0001, precompute_distances=True, verbose=0, random_state=None, n_jobs=1)
km.fit(vectorized)
list2Vec = vec.transform(list2) # transform list2 using vec
km.predict(list2Vec)
The credit goes to #IrshadBhat

How to predict Label of an email using a trained NB Classifier in sklearn?

I have created a Gaussian Naive Bayes classifier on a email (spam/not spam) dataset and was able to run it successfully. I vectorized the data, divided in it train and test sets and then calculated the accuracy, all the features that are present in the sklearn-Gaussian Naive Bayes classifier.
Now I want to be able to use this classifier to predict "labels" for new emails - whether they are by spam or not.
For example say I have an email. I want to feed it to my classifier and get the prediction as to whether it is a spam or not. How can I achieve this? Please Help.
Code for classifier file.
#!/usr/bin/python
import sys
from time import time
import logging
# Display progress logs on stdout
logging.basicConfig(level = logging.DEBUG, format = '%(asctime)s %(message)s')
sys.path.append("../DatasetProcessing/")
from vectorize_split_dataset import preprocess
### features_train and features_test are the features
for the training and testing datasets, respectively### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()
#########################################################
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
t0 = time()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print("training time:", round(time() - t0, 3), "s")
print(clf.score(features_test, labels_test))
## Printing Metrics
for Training and Testing
print("No. of Testing Features:" + str(len(features_test)))
print("No. of Testing Features Label:" + str(len(labels_test)))
print("No. of Training Features:" + str(len(features_train)))
print("No. of Training Features Label:" + str(len(labels_train)))
print("No. of Predicted Features:" + str(len(pred)))
## Calculating Classifier Performance
from sklearn.metrics import classification_report
y_true = labels_test
y_pred = pred
labels = ['0', '1']
target_names = ['class 0', 'class 1']
print(classification_report(y_true, y_pred, target_names = target_names, labels = labels))
# How to predict label of a new text
new_text = "You won a lottery at UK lottery commission. Reply to claim it"
Code for Vectorization
#!/usr/bin/python
import os
import pickle
import numpy
numpy.random.seed(42)
path = os.path.dirname(os.path.abspath(__file__))
### The words(features) and label_data(labels), already largely processed.###These files should have been created beforehand
feature_data_file = path + "./createdDataset/dataSet.pkl"
label_data_file = path + "./createdDataset/dataLabel.pkl"
feature_data = pickle.load(open(feature_data_file, "rb"))
label_data = pickle.load(open(label_data_file, "rb"))
### test_size is the percentage of events assigned to the test set(the### remainder go into training)### feature matrices changed to dense representations
for compatibility with### classifier functions in versions 0.15.2 and earlier
from sklearn import cross_validation
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(feature_data, label_data, test_size = 0.1, random_state = 42)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf = True, max_df = 0.5, stop_words = 'english')
features_train = vectorizer.fit_transform(features_train)
features_test = vectorizer.transform(features_test)#.toarray()
## feature selection to reduce dimensionality
from sklearn.feature_selection import SelectPercentile, f_classif
selector = SelectPercentile(f_classif, percentile = 5)
selector.fit(features_train, labels_train)
features_train_transformed_reduced = selector.transform(features_train).toarray()
features_test_transformed_reduced = selector.transform(features_test).toarray()
features_train = features_train_transformed_reduced
features_test = features_test_transformed_reduced
def preprocess():
return features_train, features_test, labels_train, labels_test
Code for dataset generation
#!/usr/bin/python
import os
import pickle
import re
import sys
# sys.path.append("../tools/")
""
"
Starter code to process the texts of accuate and inaccurate category to extract
the features and get the documents ready for classification.
The list of all the texts from accurate category are in the accurate_files list
likewise for texts of inaccurate category are in (inaccurate_files)
The data is stored in lists and packed away in pickle files at the end.
"
""
accurate_files = open("./rawDatasetLocation/accurateFiles.txt", "r")
inaccurate_files = open("./rawDatasetLocation/inaccurateFiles.txt", "r")
label_data = []
feature_data = []
### temp_counter is a way to speed up the development--there are### thousands of lines of accurate and inaccurate text, so running over all of them### can take a long time### temp_counter helps you only look at the first 200 lines in the list so you### can iterate your modifications quicker
temp_counter = 0
for name, from_text in [("accurate", accurate_files), ("inaccurate", inaccurate_files)]:
for path in from_text: ###only look at first 200 texts when developing### once everything is working, remove this line to run over full dataset
temp_counter = 1
if temp_counter < 200:
path = os.path.join('..', path[: -1])
print(path)
text = open(path, "r")
line = text.readline()
while line: ###use a
function parseOutText to extract the text from the opened text# stem_text = parseOutText(text)
stem_text = text.readline().strip()
print(stem_text)### use str.replace() to remove any instances of the words# stem_text = stem_text.replace("germani", "")### append the text to feature_data
feature_data.append(stem_text)### append a 0 to label_data
if text is from Sara, and 1
if text is from Chris
if (name == "accurate"):
label_data.append("0")
elif(name == "inaccurate"):
label_data.append("1")
line = text.readline()
text.close()
print("texts processed")
accurate_files.close()
inaccurate_files.close()
pickle.dump(feature_data, open("./createdDataset/dataSet.pkl", "wb"))
pickle.dump(label_data, open("./createdDataset/dataLabel.pkl", "wb"))
Also I want to know whether i can incrementally train the classifier meaning thereby that retrain a created model with newer data for refining the model over time?
I would be really glad if someone can help me out with this. I am really stuck at this point.

You are already using your model to predict labels of emails in your test set. This is what pred = clf.predict(features_test) does. If you want to see these labels, do print pred.
But perhaps you what to know how you can predict labels for emails that you discover in the future and that are not currently in your test set? If so, you can think of your new email(s) as a new test set. As with your previous test set, you will need to run several key processing steps on the data:
1) The first thing you need to do is to generate features for your new email data. The feature generation step is not included in your code above, but will need to occur.
2) You are using a Tfidf vectorizer, which converts a collection of documents to a matrix of Tfidf features based upon term frequency and inverse document frequency. You need to put your new email test feature data through the vectorizer that you fit on your training data.
3) Then your new email test feature data will need to go through dimensionality reduction using the same selector that you fit on your training data.
4) Finally, run predict on your new test data. Use print pred if you want to view the new label(s).
To respond to your final question about iteratively re-training your model, yes you definitely can do this. It's just a matter of selecting a frequency, producing a script that expands your data set with incoming data, then re-running all steps from there, from pre-processing to Tfidf vectorization, to dimensionality reduction, to fitting, and prediction.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string