Accuracy in naive bayes classification is 100% - python-3.x

I have a classification problem for which I want to do classification for class A, B, and C. I try to use naive bayes classifier and the accuracy is 100%, which I really doubt is not true. I have small dataset around 350, among that class A is 140, class B is 140 and rest are class C. Here is the code I used. Can anyone please provide me some suggestions on this?
import sklearn
from sklearn.metrics import accuracy_score
X = feature_data_frame.values
y = label_data
import sklearn.preprocessing as preprocessing
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=.10)
gnb = GaussianNB()
y_pred = gnb.fit(x_train, y_train).predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
Thanks in advance.

Related

LogisticRegression classifier

I need to use Logistic Regression classifier I have dataset the length of each column 2000 this is all my code:
from statistics import mode
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LogisticRegression
# Importing the datasets
###Social_Network_Ads
datasets = pd.read_csv('C:/Users/n3.csv',header=None)
X = datasets.iloc[:, 0:5].values
Y = datasets.iloc[:, 5].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_Train, X_Test, Y_Train, Y_Test = train_test_split(X, Y, test_size = 0.25, random_state = 0)
# instantiate the model (using the default parameters)
model = LogisticRegression()
# fit the model with data
model.fit(X_Train, Y_Train)
predicted = cross_val_predict(mode, X_Train, Y_Train, cv=5)
train_acc = model.score(X_Train, Y_Train)
print("The Accuracy for Training Set is {}".format(train_acc*100))
But in I got on this error:
TypeError: Cannot clone object '<function mode at 0x000000FD6579B9D0>'
(type <class 'function'>): it does not seem to be a scikit-learn
estimator as it does not implement a 'get_params' method.
How solve this?
Change this line
predicted = cross_val_predict(mode, X_Train, Y_Train, cv=5)
to
predicted = cross_val_predict(model, X_Train, Y_Train, cv=5)
You have a simple typo. You want to pass your estimator to the function but instead you passed mode which is imported from statistics. That's why the error tells you that it can not clone an object of type function. You are passing a function but it expects an estimator.

Confusion Matrix in SkLearn showing error

I am trying to plot a confusion matrix for my classification model given the iris dataset. However, I keep getting an error. I hope someone can guide.Thanks
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn import metrics
from sklearn.metrics import confusion_matrix
def train_and_predict(train_input_features, train_outputs, prediction_features):
classifier=tree.DecisionTreeClassifier()
classifier.fit(train_input_features,train_outputs)
predictions=classifier.predict(prediction_features)
iris = datasets.load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,test_size=0.3, random_state=0)
y_pred = train_and_predict(X_train, y_train, X_test)
print(confusion_matrix(y_test, predictions))
OUT: NameError: name 'predictions' is not defined
I found out that I needed to paste the code within the function,i.e.:
def train_and_predict(train_input_features, train_outputs, prediction_features):
classifier=tree.DecisionTreeClassifier()
classifier.fit(train_input_features,train_outputs)
predictions=classifier.predict(prediction_features)
print(predictions)
print('Confusion matrix\n',confusion_matrix(y_test,classifier.predict(X_test)))

Calculating precision, recall, and F-measure for Logistic Regression classifier

I have a labeled and clean dataset for sentiment analysis, and I used logistic regression for classification. Here is my code.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
xl = pd.ExcelFile('d:/data.xlsx')
df3 = xl.parse("Sheet1")
cl_data, sent = df3['Clean-Reviews'].fillna(' '), df3['Sentiment']
sent_train, sent_test, y_train, y_test = train_test_split(cl_data, sent,
test_size=0.25, random_state=1000)
vectorizer = CountVectorizer()
vectorizer.fit(sent_train)
X_train = vectorizer.transform(sent_train)
X_test = vectorizer.transform(sent_test)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
when I try to calculate precision, recall, and F-measure:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
print(f1_score(X_test, y_test, average="macro"))
print(precision_score(X_test, y_test, average="macro"))
print(recall_score(X_test, y_test, average="macro"))
I got an error:
TypeError: len() of unsized object
Can anyone tell what's the problem here? Thanks in Advance
accuracy is measured between predicted and true value, and in your code x_test is not a predicted value. it should be
y_pred = classifier.predict(x_test)
print(f1_score(y_test,y_pred, average="macro"))

How to compute accuracy and the confusion matrix using K-fold cross-validation?

I tried to do K-fold cross-validation with K=30 folds, with one confusion matrix for each fold. How to compute the accuracy and the confusion matrix to the model with confidence interval?
Could someone help me?
My code is:
import numpy as np
from sklearn import model_selection
from sklearn import datasets
from sklearn import svm
import pandas as pd
from sklearn.linear_model import LogisticRegression
UNSW = pd.read_csv('/home/sec/Desktop/CEFET/tudao.csv')
previsores = UNSW.iloc[:,UNSW.columns.isin(('sload','dload',
'spkts','dpkts','swin','dwin','smean','dmean',
'sjit','djit','sinpkt','dinpkt','tcprtt','synack','ackdat','ct_srv_src','ct_srv_dst','ct_dst_ltm',
'ct_src_ltm','ct_src_dport_ltm','ct_dst_sport_ltm','ct_dst_src_ltm')) ].values
classe= UNSW.iloc[:, -1].values
X_train, X_test, y_train, y_test = model_selection.train_test_split(
previsores, classe, test_size=0.4, random_state=0)
print(X_train.shape, y_train.shape)
#((90, 4), (90,))
print(X_test.shape, y_test.shape)
#((60, 4), (60,))
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
print(previsores.shape)
########K FOLD
print('########K FOLD########K FOLD########K FOLD########K FOLD')
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
kf = KFold(n_splits=30, random_state=None, shuffle=False)
kf.get_n_splits(previsores)
for train_index, test_index in kf.split(previsores):
X_train, X_test = previsores[train_index], previsores[test_index]
y_train, y_test = classe[train_index], classe[test_index]
logmodel.fit(X_train, y_train)
print (confusion_matrix(y_test, logmodel.predict(X_test)))
print(10* '#')
For accuracy, I would use the function cross_val_score that does exactly what you are looking for. It outputs a list of 30 validation accuracies and you can then compute their mean, standard deviation, etc and create some kind of a confidence interval (mean +- 2*std)
.
Since confusion matrix cannot be seen as a performance metric (not a single number but a matrix) I would recommend creating a list and then iteratively just append it with a corresponding validation confusion matrix (currently you just print it). At the end, you can use this list to extract a lot of interesting information.
UPDATE:
...
...
cm_holder = []
for train_index, test_index in kf.split(previsores):
X_train, X_test = previsores[train_index], previsores[test_index]
y_train, y_test = classe[train_index], classe[test_index]
logmodel.fit(X_train, y_train)
cm_holder.append(confusion_matrix(y_test, logmodel.predict(X_test))))
Note that the len(cm_holder) = 30 and each of the elements is an array of shape=(n_classes, n_classes).

cross validation and text classification for imbalanced data

I am new to NLP and I am trying to build a text classifier but my data is currently imbalanced.The highest category having as much as 280 entries while the lowest as much as 30.
I am trying to use cross validation technique for the current data, but after looking for days now i am unable to implement it.It looks pretty straightforward but I am still unable to implement it. Here is my code
y = resample.Subsystem
X = resample['new description']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape
#SVM
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
text_clf_svm = Pipeline([('vect', CountVectorizer(stop_words='english')),('tfidf', TfidfTransformer()),('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, n_iter=5, random_state=42)),])
text_clf_svm.fit(X_train, y_train)
predicted_svm = text_clf_svm.predict(X_test)
print('The best accuracy is : ',np.mean(predicted_svm == y_test))
I have done some gridsearch and Stemmer further but right now I would work on cross validation on this code.I have cleaned the data pretty well but i am stil getting an accuracy of 60%
Any help would be appreciated
Try to do oversampling or under sampling. As the data is highly imbalanced, There is more bias towards the class with more data points. After the over/under sampling the bias will be very less and accuracy will up.
Else instead of SVM you can use MLP. It gives good results even with unbalanced data.
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, random_state=None)
# X is the feature set and y is the target
from sklearn.model_selection import RepeatedKFold
kf = RepeatedKFold(n_splits=20, n_repeats=10, random_state=None)
for train_index, test_index in kf.split(X):
#print("Train:", train_index, "Validation:",test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]

Resources