sklearn Cross validation score gives the same results for every number of folds - python-3.x

I can't figure out why cross validation gives me always the same accuracy (0.92), no matter how much folds i use.
Even when i delete parameter cv=10 it gives me the same result.
#read preprocessed data
traindata = ast.literal_eval(open('pretprocesirano.txt').read())
testdata = ast.literal_eval(open('pretprocesiranoTEST.txt').read())
#create word vector
vectorizer= CountVectorizer(tokenizer=lambda x:x.split(), min_df=3, max_features=300)
traindataCV=vectorizer.fit_transform(traindata)
#save wordlist
wordlist=vectorizer.vocabulary_
#save vectorizer
SavedVectorizer = CountVectorizer(vocabulary=wordlist)
#transform test data
testdataCV=SavedVectorizer.transform(testdata)
#modeling-NaiveBayes
clf = MultinomialNB()
clf.fit(traindataCV, label_train)
#cross validation score
CrossValScore = cross_val_score(clf, traindataCV, label_train, cv=10)
print("Accuracy CrossValScore: %0.3f" %CrossValScore.mean())
I tried this way too, and i also got the same results (0.92). This happens even when i change the number of folds, or remove it.
from sklearn.model_selection import KFold
CrossValScore = cross_val_score(clf, traindataCV, label_train, cv=KFold(10, shuffle=False, random_state=0))
print("Accuracy CrossValScore: %0.3f" %CrossValScore.mean())
Here are some samples:
traindata= ['ucg investment bank studying unicredit intesa paschi merger sole', 'mtoken sredstva autentifikacije intesa line umesto mini cda cega line vise moze koristi aktivacija', 'pll intesa', 'intesa and unicredit banka asset management the leading italia lenders are both after more fee income but url', 'about write intesa scene colbie cailat fosterthepeople that involves sexy taj between these url']
testdata= ['naumovic samo privilegovani nije delatnosti moci imati hit nama traziti depozit rimuje mentionpositive', 'breaking unicredit board okays launch bad loans vehicle with intesa kkr read more url', 'postoji promocija kupovina telefon rate telefon banka popust pretplata url', 'direktor politike haha struja obecao stan svi zaposliti kredit komercijalna banka', 'forex update unicredit and intesa pool bln euros bad loans kkr vehicle url']
label_train=[0 1 0 0 0]
label_test=[1 0 1 1 0]

Related

Python: How to to print a for loop with multiple arguments for a numpy.int64?

I want to print the samples from the classification that has been labeled wrong.
I found this code from Sklearn SVM - how to get a list of the wrong predictions?
for idx, input, prediction, label in zip(enumerate(X_test), X_test, predicted, y_test):
print("No.", idx[0], 'input,',input, ', has been classified as', prediction, 'and should be', label)
I get this TypeError: 'numpy.int64' object is not iterable
My data consists of text data(emails) from folders that are converted by TFIDF to int, and there are about 250 files that have been misclassified, which I want to list in order to get a deeper look into the files that are misclassified.
Please help me to find a way to list these misclassifications.
The data consists of more than 4000 emails like this:
Email[X_test]:
messageid 14149441075861143483javamailevansthyme date thu 13 dec 2001 051749 0800 pst from staylorsdecom to teblokeyenroncom subject flight mimeversion 10 contenttype textplain charsetusascii contenttransferencoding 7bit xfrom taylor sandy staylorsdecom xto teblokeyenroncom xcc xbcc xfolder teblokeymar2002lokey tebinbox xorigin lokeyt xfilename tlokey nonprivilegedpst ive made a tentative reservation on continental to leave thursday dec 20 at 550 pm stop in cleveland no change and arrive in houston at 1033 pm return first class on sunday dec 30 at 1050 am change in cleveland and arrive manchester at 503 pm how about making a reservation to fly back with me you can always cancel and return whenever it cheers me just to know that youd even consider this you need the break and i could use the company i know youd love deb friendtenant and she you and ditto for gracie let me know your thoughts love sandy
And after it is transformed with TfidfVectorizer() and todense(), the email looks like this.
X_test[example]:
[[0. 0. 0.03120722 ... 0. 0. 0. ]]
The vaules represent the tf-idf count.
type of X_test: <class 'numpy.matrix'>
(4519, 115674)
4519: number of emails within X_test
115674: number of features (unique terms)
The emails are labeled as phish (1) or legit (0).
#Fit motel to data
model = LogisticRegression()
model.fit(X_train, y_train)
# make predictions
expected = y_test
predicted = model.predict(X_test)
proba = model.predict_proba(X_test)
# Scores
accuracy = accuracy_score(expected, predicted)
recall = recall_score(expected, predicted, average="binary")
precision = precision_score(expected, predicted , average="binary")
f1 = f1_score(expected, predicted , average="binary")
# Confustion matrix
cm = metrics.confusion_matrix(expected, predicted)
print(cm)
This is when I want to list the misclassifications from X_test.

I have a dataset with one target column and two text columns.It is an nlp problem which i am trying to solve through deep learning

I am dealing with a dataset where I have 3 fields. One field is my target field and the other two field are text fields. It is basically an NLP based problem statement. I am trying to approach a deep learning mechanism but while taking into account the two text fields I am getting an error at tokenizing the X_train data post train test split.
I have already read the dataset and label encoded the target column. I have cleaned up the text columns and used stemmer to further lemmatize them. I have stored the two text columns in X and the target column in y. Then, I have performed a train test split.After that I am trying to tokenize X_train which is giving me an error. Review Text and Review Title are text columns.
df=pd.read_csv('train_amazon.csv')
df.head(10)
df['topic'].nunique()
df['topic'].value_counts()
df['Review Text'].isnull().any()
df['Review Title'].isnull().any()
df['topic'].isnull().any()
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['topic'] = le.fit_transform(df['topic'])
df.head()
le.classes_
dummy_y = pd.get_dummies(df['topic']).values
X =df.iloc[:, :-1].values
y = dummy_y
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 101)
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
vocabulary_size = len(tokenizer.word_index) + 1
vocabulary_size
I am getting the error as follows:-
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-67-7ab7cb886988> in <module>
1 tokenizer = Tokenizer()
----> 2 tokenizer.fit_on_texts(X_train)
3 vocabulary_size = len(tokenizer.word_index) + 1
4 vocabulary_size
~\Anaconda3\lib\site-packages\keras_preprocessing\text.py in fit_on_texts(self, texts)
221 self.filters,
222 self.lower,
--> 223 self.split)
224 for w in seq:
225 if w in self.word_counts:
~\Anaconda3\lib\site-packages\keras_preprocessing\text.py in text_to_word_sequence(text, filters, lower, split)
41 """
42 if lower:
---> 43 text = text.lower()
44
45 if sys.version_info < (3,):
AttributeError: 'numpy.ndarray' object has no attribute 'lower'
My X_train has a shape (4469,2)
and my X_train looks like :-
array([['use sinc seal miss', 'broken seal'],
['took week immedi effect 1 2 hour hour ingest includ tingl extrem slight relax probabl help anxieti much like medic numb make care less what is bother you howev product made difficult focus short term memori sever impact bout week stuff good detail orient job trust take longer sinc long term effect like unknown care take unregul supplements!!!!',
'careless'],
['smell aw mean rancid could make sick sooooo annoy wish could money back',
'rancid pill'],
...,
['didn t realiz serv size capsul purchas huge deal fault prefer take pill vitamin idea it s work help',
'vitamin yeah'],
['horribl taste! wast money', 'horribl fake tast'],
['nasti stuff work with thick dropper doesn t work well finger bottl leav sticki mess don t lick bitter',
'nasti']], dtype=object)

How can I calculate perplexity using nltk

I try to do some process on a text. It's part of my code:
fp = open(train_file)
raw = fp.read()
sents = fp.readlines()
words = nltk.tokenize.word_tokenize(raw)
bigrams = ngrams(words,2, left_pad_symbol='<s>', right_pad_symbol=</s>)
fdist = nltk.FreqDist(words)
In the old versions of nltk I found this code on StackOverflow for perplexity
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(5, train, estimator=estimator)
print("len(corpus) = %s, len(vocabulary) = %s, len(train) = %s, len(test) = %s" % ( len(corpus), len(vocabulary), len(train), len(test) ))
print("perplexity(test) =", lm.perplexity(test))
However, this code is no longer valid, and I didn't find any other package or function in nltk for this purpose. Should I implement it?
Perplexity
Lets assume we have a model which takes as input an English sentence and gives out a probability score corresponding to how likely its is a valid English sentence. We want to determined how good this model is. A good model should give high score to valid English sentences and low score to invalid English sentences. Perplexity is a popularly used measure to quantify how "good" such a model is. If a sentence s contains n words then perplexity
Modeling probability distribution p (building the model)
can be expanded using chain rule of probability
So given some data (called train data) we can calculated the above conditional probabilities. However, practically it is not possible as it will requires huge amount of training data. We then make assumption to calculate
Assumption : All words are independent (unigram)
Assumption : First order Markov assumption (bigram)
Next words depends only on the previous word
Assumption : n order Markov assumption (ngram)
Next words depends only on the previous n words
MLE to estimate probabilities
Maximum Likelihood Estimate(MLE) is one way to estimate the individual probabilities
Unigram
where
count(w) is number of times the word w appears in the train data
count(vocab) is the number of uniques words (called vocabulary) in the train data.
Bigram
where
count(w_{i-1}, w_i) is number of times the words w_{i-1}, w_i appear together in same sequence (bigram) in the train data
count(w_{i-1}) is the number of times the word w_{i-1} appear in the train data. w_{i-1} is called context.
Calculating Perplexity
As we have seen above $p(s)$ is calculated by multiplying lots of small numbers and so it is not numerically stable because of limited precision of floating point numbers on a computer. Lets use the nice properties of log to simply it. We know
Example: Unigram model
Train Data ["an apple", "an orange"]
Vocabulary : [an, apple, orange, UNK]
MLE estimates
For test sentence "an apple"
l = (np.log2(0.5) + np.log2(0.25))/2 = -1.5
np.power(2, -l) = 2.8284271247461903
For test sentence "an ant"
l = (np.log2(0.5) + np.log2(0))/2 = inf
Code
import nltk
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE
train_sentences = ['an apple', 'an orange']
tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent)))
for sent in train_sentences]
n = 1
train_data, padded_vocab = padded_everygram_pipeline(n, tokenized_text)
model = MLE(n)
model.fit(train_data, padded_vocab)
test_sentences = ['an apple', 'an ant']
tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent)))
for sent in test_sentences]
test_data, _ = padded_everygram_pipeline(n, tokenized_text)
for test in test_data:
print ("MLE Estimates:", [((ngram[-1], ngram[:-1]),model.score(ngram[-1], ngram[:-1])) for ngram in test])
test_data, _ = padded_everygram_pipeline(n, tokenized_text)
for i, test in enumerate(test_data):
print("PP({0}):{1}".format(test_sentences[i], model.perplexity(test)))
Example: Bigram model
Train Data: "an apple", "an orange"
Padded Train Data: "(s) an apple (/s)", "(s) an orange (/s)"
Vocabulary : (s), (/s) an, apple, orange, UNK
MLE estimates
For test sentence "an apple" Padded : "(s) an apple (/s)"
l = (np.log2(p(an|<s> ) + np.log2(p(apple|an) + np.log2(p(</s>|apple))/3 =
(np.log2(1) + np.log2(0.5) + np.log2(1))/3 = -0.3333
np.power(2, -l) = 1.
For test sentence "an ant" Padded : "(s) an ant (/s)"
l = (np.log2(p(an|<s> ) + np.log2(p(ant|an) + np.log2(p(</s>|ant))/3 = inf
Code
import nltk
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE
from nltk.lm import Vocabulary
train_sentences = ['an apple', 'an orange']
tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) for sent in train_sentences]
n = 2
train_data = [nltk.bigrams(t, pad_right=True, pad_left=True, left_pad_symbol="<s>", right_pad_symbol="</s>") for t in tokenized_text]
words = [word for sent in tokenized_text for word in sent]
words.extend(["<s>", "</s>"])
padded_vocab = Vocabulary(words)
model = MLE(n)
model.fit(train_data, padded_vocab)
test_sentences = ['an apple', 'an ant']
tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) for sent in test_sentences]
test_data = [nltk.bigrams(t, pad_right=True, pad_left=True, left_pad_symbol="<s>", right_pad_symbol="</s>") for t in tokenized_text]
for test in test_data:
print ("MLE Estimates:", [((ngram[-1], ngram[:-1]),model.score(ngram[-1], ngram[:-1])) for ngram in test])
test_data = [nltk.bigrams(t, pad_right=True, pad_left=True, left_pad_symbol="<s>", right_pad_symbol="</s>") for t in tokenized_text]
for i, test in enumerate(test_data):
print("PP({0}):{1}".format(test_sentences[i], model.perplexity(test)))

Using predict on new text with kmeans (sklearn)?

I have a very small list of short strings which I want to (1) cluster and (2) use that model to predict which cluster a new string belongs to.
Running the first part works fine, getting a prediction for the new string does not.
First Part
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
# List of
documents_lst = ['a small, narrow river',
'a continuous flow of liquid, air, or gas',
'a continuous flow of data or instructions, typically one having a constant or predictable rate.',
'a group in which schoolchildren of the same age and ability are taught',
'(of liquid, air, gas, etc.) run or flow in a continuous current in a specified direction',
'transmit or receive (data, especially video and audio material) over the Internet as a steady, continuous flow.',
'put (schoolchildren) in groups of the same age and ability to be taught together',
'a natural body of running water flowing on or under the earth']
# 1. Vectorize the text
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(documents_lst)
print('tfidf_matrix.shape: ', tfidf_matrix.shape)
# 2. Get the number of clusters to make .. (find a better way than random)
num_clusters = 3
# 3. Cluster the defintions
km = KMeans(n_clusters=num_clusters, init='k-means++').fit(tfidf_matrix)
clusters = km.labels_.tolist()
print(clusters)
Which returns:
tfidf_matrix.shape: (8, 39)
[0, 1, 0, 2, 1, 0, 2, 0]
Second Part
The failing part:
predict_doc = ['A stream is a body of water with a current, confined within a bed and banks.']
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(predict_doc)
print('tfidf_matrix.shape: ', tfidf_matrix.shape)
km.predict(tfidf_matrix)
The error:
ValueError: Incorrect number of features. Got 7 features, expected 39
FWIW: I somewhat understand that the training and predict have a different amount of features after vectorizing ...
I am open to any solution including changing from kmeans to an algorithm more suitable for short text clustering.
Thanks in advance
For completeness I will answer my own question with an answer from here , that doesn't answer that question. But answers mine
from sklearn.cluster import KMeans
list1 = ["My name is xyz", "My name is pqr", "I work in abc"]
list2 = ["My name is xyz", "I work in abc"]
vectorizer = TfidfVectorizer(min_df = 0, max_df=0.5, stop_words = "english", charset_error = "ignore", ngram_range = (1,3))
vec = vectorizer.fit(list1) # train vec using list1
vectorized = vec.transform(list1) # transform list1 using vec
km = KMeans(n_clusters=2, init='k-means++', n_init=10, max_iter=1000, tol=0.0001, precompute_distances=True, verbose=0, random_state=None, n_jobs=1)
km.fit(vectorized)
list2Vec = vec.transform(list2) # transform list2 using vec
km.predict(list2Vec)
The credit goes to #IrshadBhat

How to predict Label of an email using a trained NB Classifier in sklearn?

I have created a Gaussian Naive Bayes classifier on a email (spam/not spam) dataset and was able to run it successfully. I vectorized the data, divided in it train and test sets and then calculated the accuracy, all the features that are present in the sklearn-Gaussian Naive Bayes classifier.
Now I want to be able to use this classifier to predict "labels" for new emails - whether they are by spam or not.
For example say I have an email. I want to feed it to my classifier and get the prediction as to whether it is a spam or not. How can I achieve this? Please Help.
Code for classifier file.
#!/usr/bin/python
import sys
from time import time
import logging
# Display progress logs on stdout
logging.basicConfig(level = logging.DEBUG, format = '%(asctime)s %(message)s')
sys.path.append("../DatasetProcessing/")
from vectorize_split_dataset import preprocess
### features_train and features_test are the features
for the training and testing datasets, respectively### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()
#########################################################
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
t0 = time()
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
print("training time:", round(time() - t0, 3), "s")
print(clf.score(features_test, labels_test))
## Printing Metrics
for Training and Testing
print("No. of Testing Features:" + str(len(features_test)))
print("No. of Testing Features Label:" + str(len(labels_test)))
print("No. of Training Features:" + str(len(features_train)))
print("No. of Training Features Label:" + str(len(labels_train)))
print("No. of Predicted Features:" + str(len(pred)))
## Calculating Classifier Performance
from sklearn.metrics import classification_report
y_true = labels_test
y_pred = pred
labels = ['0', '1']
target_names = ['class 0', 'class 1']
print(classification_report(y_true, y_pred, target_names = target_names, labels = labels))
# How to predict label of a new text
new_text = "You won a lottery at UK lottery commission. Reply to claim it"
Code for Vectorization
#!/usr/bin/python
import os
import pickle
import numpy
numpy.random.seed(42)
path = os.path.dirname(os.path.abspath(__file__))
### The words(features) and label_data(labels), already largely processed.###These files should have been created beforehand
feature_data_file = path + "./createdDataset/dataSet.pkl"
label_data_file = path + "./createdDataset/dataLabel.pkl"
feature_data = pickle.load(open(feature_data_file, "rb"))
label_data = pickle.load(open(label_data_file, "rb"))
### test_size is the percentage of events assigned to the test set(the### remainder go into training)### feature matrices changed to dense representations
for compatibility with### classifier functions in versions 0.15.2 and earlier
from sklearn import cross_validation
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(feature_data, label_data, test_size = 0.1, random_state = 42)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf = True, max_df = 0.5, stop_words = 'english')
features_train = vectorizer.fit_transform(features_train)
features_test = vectorizer.transform(features_test)#.toarray()
## feature selection to reduce dimensionality
from sklearn.feature_selection import SelectPercentile, f_classif
selector = SelectPercentile(f_classif, percentile = 5)
selector.fit(features_train, labels_train)
features_train_transformed_reduced = selector.transform(features_train).toarray()
features_test_transformed_reduced = selector.transform(features_test).toarray()
features_train = features_train_transformed_reduced
features_test = features_test_transformed_reduced
def preprocess():
return features_train, features_test, labels_train, labels_test
Code for dataset generation
#!/usr/bin/python
import os
import pickle
import re
import sys
# sys.path.append("../tools/")
""
"
Starter code to process the texts of accuate and inaccurate category to extract
the features and get the documents ready for classification.
The list of all the texts from accurate category are in the accurate_files list
likewise for texts of inaccurate category are in (inaccurate_files)
The data is stored in lists and packed away in pickle files at the end.
"
""
accurate_files = open("./rawDatasetLocation/accurateFiles.txt", "r")
inaccurate_files = open("./rawDatasetLocation/inaccurateFiles.txt", "r")
label_data = []
feature_data = []
### temp_counter is a way to speed up the development--there are### thousands of lines of accurate and inaccurate text, so running over all of them### can take a long time### temp_counter helps you only look at the first 200 lines in the list so you### can iterate your modifications quicker
temp_counter = 0
for name, from_text in [("accurate", accurate_files), ("inaccurate", inaccurate_files)]:
for path in from_text: ###only look at first 200 texts when developing### once everything is working, remove this line to run over full dataset
temp_counter = 1
if temp_counter < 200:
path = os.path.join('..', path[: -1])
print(path)
text = open(path, "r")
line = text.readline()
while line: ###use a
function parseOutText to extract the text from the opened text# stem_text = parseOutText(text)
stem_text = text.readline().strip()
print(stem_text)### use str.replace() to remove any instances of the words# stem_text = stem_text.replace("germani", "")### append the text to feature_data
feature_data.append(stem_text)### append a 0 to label_data
if text is from Sara, and 1
if text is from Chris
if (name == "accurate"):
label_data.append("0")
elif(name == "inaccurate"):
label_data.append("1")
line = text.readline()
text.close()
print("texts processed")
accurate_files.close()
inaccurate_files.close()
pickle.dump(feature_data, open("./createdDataset/dataSet.pkl", "wb"))
pickle.dump(label_data, open("./createdDataset/dataLabel.pkl", "wb"))
Also I want to know whether i can incrementally train the classifier meaning thereby that retrain a created model with newer data for refining the model over time?
I would be really glad if someone can help me out with this. I am really stuck at this point.
You are already using your model to predict labels of emails in your test set. This is what pred = clf.predict(features_test) does. If you want to see these labels, do print pred.
But perhaps you what to know how you can predict labels for emails that you discover in the future and that are not currently in your test set? If so, you can think of your new email(s) as a new test set. As with your previous test set, you will need to run several key processing steps on the data:
1) The first thing you need to do is to generate features for your new email data. The feature generation step is not included in your code above, but will need to occur.
2) You are using a Tfidf vectorizer, which converts a collection of documents to a matrix of Tfidf features based upon term frequency and inverse document frequency. You need to put your new email test feature data through the vectorizer that you fit on your training data.
3) Then your new email test feature data will need to go through dimensionality reduction using the same selector that you fit on your training data.
4) Finally, run predict on your new test data. Use print pred if you want to view the new label(s).
To respond to your final question about iteratively re-training your model, yes you definitely can do this. It's just a matter of selecting a frequency, producing a script that expands your data set with incoming data, then re-running all steps from there, from pre-processing to Tfidf vectorization, to dimensionality reduction, to fitting, and prediction.

Resources