LogisticRegression classifier - python-3.x

I need to use Logistic Regression classifier I have dataset the length of each column 2000 this is all my code:
from statistics import mode
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LogisticRegression
# Importing the datasets
###Social_Network_Ads
datasets = pd.read_csv('C:/Users/n3.csv',header=None)
X = datasets.iloc[:, 0:5].values
Y = datasets.iloc[:, 5].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_Train, X_Test, Y_Train, Y_Test = train_test_split(X, Y, test_size = 0.25, random_state = 0)
# instantiate the model (using the default parameters)
model = LogisticRegression()
# fit the model with data
model.fit(X_Train, Y_Train)
predicted = cross_val_predict(mode, X_Train, Y_Train, cv=5)
train_acc = model.score(X_Train, Y_Train)
print("The Accuracy for Training Set is {}".format(train_acc*100))
But in I got on this error:
TypeError: Cannot clone object '<function mode at 0x000000FD6579B9D0>'
(type <class 'function'>): it does not seem to be a scikit-learn
estimator as it does not implement a 'get_params' method.
How solve this?

Change this line
predicted = cross_val_predict(mode, X_Train, Y_Train, cv=5)
to
predicted = cross_val_predict(model, X_Train, Y_Train, cv=5)
You have a simple typo. You want to pass your estimator to the function but instead you passed mode which is imported from statistics. That's why the error tells you that it can not clone an object of type function. You are passing a function but it expects an estimator.

Related

Calculating precision, recall, and F-measure for Logistic Regression classifier

I have a labeled and clean dataset for sentiment analysis, and I used logistic regression for classification. Here is my code.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
xl = pd.ExcelFile('d:/data.xlsx')
df3 = xl.parse("Sheet1")
cl_data, sent = df3['Clean-Reviews'].fillna(' '), df3['Sentiment']
sent_train, sent_test, y_train, y_test = train_test_split(cl_data, sent,
test_size=0.25, random_state=1000)
vectorizer = CountVectorizer()
vectorizer.fit(sent_train)
X_train = vectorizer.transform(sent_train)
X_test = vectorizer.transform(sent_test)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
when I try to calculate precision, recall, and F-measure:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
print(f1_score(X_test, y_test, average="macro"))
print(precision_score(X_test, y_test, average="macro"))
print(recall_score(X_test, y_test, average="macro"))
I got an error:
TypeError: len() of unsized object
Can anyone tell what's the problem here? Thanks in Advance
accuracy is measured between predicted and true value, and in your code x_test is not a predicted value. it should be
y_pred = classifier.predict(x_test)
print(f1_score(y_test,y_pred, average="macro"))

FastText: Can't get cross_validation

I am struggling to implement FastText (FTTransformer) into a Pipeline that iterates over different vectorizers. More particular, I can't get cross-validation scores. Following code is used:
%%time
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from gensim.utils import simple_preprocess
from gensim.sklearn_api.ftmodel import FTTransformer
np.random.seed(0)
data = pd.read_csv('https://pastebin.com/raw/dqKFZ12m')
X_train, X_test, y_train, y_test = train_test_split(data.text, data.label, random_state=0)
w2v_texts = [simple_preprocess(doc) for doc in X_train]
models = [FTTransformer(size=10, min_count=0, seed=42)]
classifiers = [LogisticRegression(random_state=0)]
for model in models:
for classifier in classifiers:
model.fit(w2v_texts)
classifier.fit(model.transform(X_train), y_train)
pipeline = Pipeline([
('vec', model),
('clf', classifier)
])
print(pipeline.score(X_train, y_train))
#print(model.gensim_model.wv.most_similar('kirk'))
cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=5)
KeyError: 'all ngrams for word "Machine learning can be useful
branding sometimes" absent from model'
How can the problem be solved?
Sidenote: My other pipelines with D2VTransformer or TfIdfVectorizer work just fine. Here, I can simply apply pipeline.fit(X_train, y_train) after defining the pipeline, instead of the two fits as shown above. It seems like FTTransformer doesn't integrate so well with other given vectorizers?
Yes, to be used in a pipeline, FTTransformer needs to be modified to split documents to words inside its fit method. One can do it as follows:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from gensim.utils import simple_preprocess
from gensim.sklearn_api.ftmodel import FTTransformer
np.random.seed(0)
class FTTransformer2(FTTransformer):
def fit(self, x, y):
super().fit([simple_preprocess(doc) for doc in x])
return self
data = pd.read_csv('https://pastebin.com/raw/dqKFZ12m')
X_train, X_test, y_train, y_test = train_test_split(data.text, data.label, random_state=0)
classifiers = [LogisticRegression(random_state=0)]
for classifier in classifiers:
pipeline = Pipeline([
('ftt', FTTransformer2(size=10, min_count=0, seed=0)),
('clf', classifier)
])
score = cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=5)
print(score)

Error encountered: Classification metrics can't handle a mix of multiclass-multioutput and binary targets

from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
file = './BBC.csv'
df = read_csv(file)
array = df.values
X = array[:, 0:11]
Y = array[:, 11]
test_size = 0.30
seed = 45
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
model = RandomForestClassifier()
model.fit(X_train, Y_train)
result = model.score(X_test, X_test)
print("Accuracy: %.3f%%") % (result*100.0)
dataset: https://www.dropbox.com/s/ar1c9yuv5x774cv/BBC.csv?dl=0
I have encountered this error:
Classification metrics can't handle a mix of multiclass-multioutput and binary targets
If i'm not wrong RandomForest should be able to handle both classes (classification) and means (regression). Am i wrong?
Edit:
Checked your dataset. So for classification task, your problem lies in your code.
result = model.score(X_test, X_test)
Note that the parameter here should be X_test and Y_test
-----kind of off-topic-----
If you want to use RandomForest for regression, you probably should call RandomForestRegressor

How to compute accuracy and the confusion matrix using K-fold cross-validation?

I tried to do K-fold cross-validation with K=30 folds, with one confusion matrix for each fold. How to compute the accuracy and the confusion matrix to the model with confidence interval?
Could someone help me?
My code is:
import numpy as np
from sklearn import model_selection
from sklearn import datasets
from sklearn import svm
import pandas as pd
from sklearn.linear_model import LogisticRegression
UNSW = pd.read_csv('/home/sec/Desktop/CEFET/tudao.csv')
previsores = UNSW.iloc[:,UNSW.columns.isin(('sload','dload',
'spkts','dpkts','swin','dwin','smean','dmean',
'sjit','djit','sinpkt','dinpkt','tcprtt','synack','ackdat','ct_srv_src','ct_srv_dst','ct_dst_ltm',
'ct_src_ltm','ct_src_dport_ltm','ct_dst_sport_ltm','ct_dst_src_ltm')) ].values
classe= UNSW.iloc[:, -1].values
X_train, X_test, y_train, y_test = model_selection.train_test_split(
previsores, classe, test_size=0.4, random_state=0)
print(X_train.shape, y_train.shape)
#((90, 4), (90,))
print(X_test.shape, y_test.shape)
#((60, 4), (60,))
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
print(previsores.shape)
########K FOLD
print('########K FOLD########K FOLD########K FOLD########K FOLD')
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
kf = KFold(n_splits=30, random_state=None, shuffle=False)
kf.get_n_splits(previsores)
for train_index, test_index in kf.split(previsores):
X_train, X_test = previsores[train_index], previsores[test_index]
y_train, y_test = classe[train_index], classe[test_index]
logmodel.fit(X_train, y_train)
print (confusion_matrix(y_test, logmodel.predict(X_test)))
print(10* '#')
For accuracy, I would use the function cross_val_score that does exactly what you are looking for. It outputs a list of 30 validation accuracies and you can then compute their mean, standard deviation, etc and create some kind of a confidence interval (mean +- 2*std)
.
Since confusion matrix cannot be seen as a performance metric (not a single number but a matrix) I would recommend creating a list and then iteratively just append it with a corresponding validation confusion matrix (currently you just print it). At the end, you can use this list to extract a lot of interesting information.
UPDATE:
...
...
cm_holder = []
for train_index, test_index in kf.split(previsores):
X_train, X_test = previsores[train_index], previsores[test_index]
y_train, y_test = classe[train_index], classe[test_index]
logmodel.fit(X_train, y_train)
cm_holder.append(confusion_matrix(y_test, logmodel.predict(X_test))))
Note that the len(cm_holder) = 30 and each of the elements is an array of shape=(n_classes, n_classes).

Sklearn elastic-net logistic regression (SGDClassifier) does not return a probability

In sklearn, when using SGDCLassifier for elastic-net logistic regression, the predict_proba function returns the same thing as the predict function.
AKA the code below (with X and y the predictors and binary label respectively) returns True:
EN = sklearn.linear_model.SGDClassifier(loss='log', penalty='elasticnet',
alpha=0.0001, l1_ratio=0.15)
EN.fit(X[train], y[train])
numpy.all(EN.predict(X[test]) == EN.predict_proba(X[test])[:,1])
How to obtain probability values?
It seems that the sklearn version is the problem. You need to upgrade to 0.18.2.
Example using iris data:
from sklearn.datasets import load_iris
from sklearn.linear import model.SGDClassifier
from sklearn.model_selection import train_test_split
import numpy
import sklearn
data = load_iris()
x = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 42)
EN = SGDClassifier(loss='log', penalty='elasticnet', alpha=0.0001, l1_ratio=0.15)
EN.fit(X_train, y_train)
numpy.all(EN.predict(X_test) == EN.predict_proba(X_test)[:,1])
sklearn.__version__
Result
False
'0.18.2'
So with sklearn 0.18.2 it works fine.

Resources