Is passing sklearn tfidf matrix to train MultinomialNB model proper? - scikit-learn

I'm do some text classification tasks. What I have observed is that if fed tfidf matrix(from sklearn's TfidfVectorizer), Logistic Regression model is always outperforming MultinomialNB model. Below is my code for training both:
X = df_new['text_content']
y = df_new['label']
X_train, X_test, y_train, y_test = train_test_split(X, y)
vectorizer = TfidfVectorizer(stop_words='english')
X_train_dtm = vectorizer.fit_transform(X_train)
X_test_dtm = vectorizer.transform(X_test)
clf_lr = LogisticRegression()
clf_lr.fit(X_train_dtm, y_train)
y_pred = clf_lr.predict(X_test_dtm)
lr_score = accuracy_score(y_test, y_pred) # perfectly balanced binary classes
clf_mnb = MultinomialNB()
clf_mnb.fit(X_train_dtm, y_train)
y_pred = clf_mnb.predict(X_test_dtm)
mnb_score = accuracy_score(y_test, y_pred) # perfectly balanced binary classes
Currently lr_score > mnb_score always. I'm wondering how exactly MultinomialNB is using the tfidf matrix since the term frequency in tfidf is calculated based on no class information. Any chance that I should not feed tfidf matrix to MultinomialNB the same way I did to LogisticRegression?
Update: I understand the difference between results of TfidfVectorizer and CountVectorizer. And I also just checked the sources code of sklearn's MultinomialNB.fit() function, looks like it does expect a count as oppose to frequency. This will also explain the performance boost mentioned in my comment below. However, I'm still wondering if under any circumstances pass tfidf into MultinomialNB makes sense. The sklearn documentation briefly mentioned the possibility, but not much details.
Any advice would be much appreciated!

Related

ML Model not predicting properly

I am trying to create an ML model (regression) using various techniques like SMR, Logistic Regression, and others. With all the techniques, I'm not able to get efficiency more than 35%. Here's what I'm doing:
X_data = [X_data_distance]
X_data = np.vstack(X_data).astype(np.float64)
X_data = X_data.T
y_data = X_data_orders
#print(X_data.shape)
#print(y_data.shape)
#(10000, 1)
#(10000,)
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.33, random_state=42)
svr_rbf = SVC(kernel= 'rbf', C= 1.0)
svr_rbf.fit(X_train, y_train)
plt.plot(X_data_distance, svr_rbf.predict(X_data), color= 'red', label= 'RBF model')
For the plot, I'm getting the following:
I have tried various parameter tuning, changing the parameter C, gamma even tried different kernels, but nothing changes the accuracy. Even tried SVR, Logistic regression instead of SVC, but nothing helps. I tried different scaling for training input data like StandardScalar() and scale().
I used this as a reference
What should I do?
As a rule of thumb, we usually follow this convention:
For little number of features, go with Logistic Regression.
For a lot of features but not a lot of data, go with SVM.
For a lot of features and a lot of data, go with Neural Network.
Because your dataset is a 10K cases, it'd be better to use Logistic Regression because SVM will take forever to finish!.
Nevertheless, because your dataset contains a lot of classes, there is a chance of classes imbalance in your implementation. Thus I tried to workaround this problem via using the StratifiedKFold instead of train_test_split which doesn't guarantee balanced classes in the splits.
Moreover, I used GridSearchCV with StratifiedKFold to perform Cross-Validation in order to tune the parameters and try all different optimizers!
So the full implementation is as follows:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV, StratifiedKFold, StratifiedShuffleSplit
import numpy as np
def getDataset(path, x_attr, y_attr):
"""
Extract dataset from CSV file
:param path: location of csv file
:param x_attr: list of Features Names
:param y_attr: Y header name in CSV file
:return: tuple, (X, Y)
"""
df = pd.read_csv(path)
X = X = np.array(df[x_attr]).reshape(len(df), len(x_attr))
Y = np.array(df[y_attr])
return X, Y
def stratifiedSplit(X, Y):
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=0)
train_index, test_index = next(sss.split(X, Y))
X_train, X_test = X[train_index], X[test_index]
Y_train, Y_test = Y[train_index], Y[test_index]
return X_train, X_test, Y_train, Y_test
def run(X_data, Y_data):
X_train, X_test, Y_train, Y_test = stratifiedSplit(X_data, Y_data)
param_grid = {'C': [0.01, 0.1, 1, 10, 100, 1000], 'penalty': ['l1', 'l2'],
'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}
model = LogisticRegression(random_state=0)
clf = GridSearchCV(model, param_grid, cv=StratifiedKFold(n_splits=10))
clf.fit(X_train, Y_train)
print(accuracy_score(Y_train, clf.best_estimator_.predict(X_train)))
print(accuracy_score(Y_test, clf.best_estimator_.predict(X_test)))
X_data, Y_data = getDataset("data - Sheet1.csv", ['distance'], 'orders')
run(X_data, Y_data)
Despite all the attempts with all different algorithms, the accuracy didn't exceed 36%!!.
Why is that?
If you want to make a person recognize/classify another person by their T-shirt color, you cannot say: hey if it's red that means he's John and if it's red it's Peter but if it's red it's Aisling!! He would say "really, what the hack is the difference"?!!.
And that's exactly what is in your dataset!
Simply, run print(len(np.unique(X_data))) and print(len(np.unique(Y_data))) and you'll find that the numbers are so weird, in a nutshell you have:
Number of Cases: 10000 !!
Number of Classes: 118 !!
Number of Unique Inputs (i.e. Features): 66 !!
All classes are sharing hell a lot of information which make it impressive to have even up to 36% accuracy!
In other words, you have no informative features which lead to a lack in the uniqueness of each class model!
What to do?
I believe you are not allowed to remove some classes, so the only two solutions you have are:
Either live with this very valid result.
Or add more informative feature(s).
Update
Having you provided same dataset but with more features (i.e. complete set of features), the situation now is different.
I recommend you do the following:
Pre-process your dataset (i.e. prepare it by imputing missing values or deleting rows containing missing values, and converting dates to some unique values (example) ...etc).
Check what features are most important to the Orders Classes, you can achieve that by using of Forests of Trees to evaluate the importance of features. Here is a complete and simple example of how to do that in Scikit-Learn.
Create a new version of the dataset but this time hold Orders as the Y response, and the above-found features as the X variables.
Follow the same GrdiSearchCV and StratifiedKFold procedure that I showed you in the implementation above.
Hint
As per mentioned by Vivek Kumar in the comment below, stratify parameter has been added in Scikit-learn update to the train_test_split function.
It works by passing the array-like ground truth, so you don't need my workaround in the function stratifiedSplit(X, Y) above.

Keras and Sklearn logreg returning different results

I'm comparing the results of a logistic regressor written in Keras to the default Sklearn Logreg. My input is one-dimensional. My output has two classes and I'm interested in the probability that the output belongs to the class 1.
I'm expecting the results to be almost identical, but they are not even close.
Here is how I generate my random data. Note that X_train, X_test are still vectors, I'm just using capital letters because I'm used to it. Also there is no need for scaling in this case.
X = np.linspace(0, 1, 10000)
y = np.random.sample(X.shape)
y = np.where(y<X, 1, 0)
Here's cumsum of y plotted over X. Doing a regression here is not rocket science.
I do a standard train-test-split:
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train = X_train.reshape(-1,1)
X_test = X_test.reshape(-1,1)
Next, I train a default logistic regressor:
from sklearn.linear_model import LogisticRegression
sk_lr = LogisticRegression()
sk_lr.fit(X_train, y_train)
sklearn_logreg_result = sk_lr.predict_proba(X_test)[:,1]
And a logistic regressor that I write in Keras:
from keras.models import Sequential
from keras.layers import Dense
keras_lr = Sequential()
keras_lr.add(Dense(1, activation='sigmoid', input_dim=1))
keras_lr.compile(loss='mse', optimizer='sgd', metrics=['accuracy'])
_ = keras_lr.fit(X_train, y_train, verbose=0)
keras_lr_result = keras_lr.predict(X_test)[:,0]
And a hand-made solution:
pearson_corr = np.corrcoef(X_train.reshape(X_train.shape[0],), y_train)[0,1]
b = pearson_corr * np.std(y_train) / np.std(X_train)
a = np.mean(y_train) - b * np.mean(X_train)
handmade_result = (a + b * X_test)[:,0]
I expect all three to deliver similar results, but here is what happens. This is a reliability diagram using 100 bins.
I have played around with loss functions and other parameters, but the Keras logreg stays roughly like this. What might be causing the problem here?
edit: Using binary crossentropy is not the solution here, as shown by this plot (note that the input data has changed between the two plots).
While both implementations are a form of Logistic Regression there's quite a few differences. While both solutions converge to a comparable minimum (0.75/0.76 ACC) they are not identical.
Optimizer - keras uses vanille SGD where sklearn's LR is based on
liblinear which implements trust region Newton method
Regularization - sklearn has built in L2 regularization
Weights -The weights are randomly initialized and probably sampled from a different distribution.

Binary classification (logistic regression) predict wrong label with high accuracy

I have a problem that a binary Logistic regression (using scikit-learn python=2.7) classification that is predicting the wrong/opposite class with a high accuracy. That is, after fitting the model the predicted score and predicted probabilities for each class are very consistent but always of the wrong class. I cannot share the data, but some pseudo-code of my approach is:
X = np.vstack((cond_1, cond_2)) # shape of X = 200*51102
y = np.concatenate([np.zeros(len(cond_1)), np.ones(len(cond_2)])
scls = []
clfs = []
scores = []
for train, test in cv.split(X, y):
clf = LogisticRegression(C=1)
scl = StandardScaler()
scl.fit(X[train])
X_train = scl.transform(X[train])
scls.append(scl)
X_test = scl.transform(X[test])
clf.fit(X_train, y[train])
y_pred = clf.predict(X_test)
scores.append(roc_auc_score(y[test], y_pred))
The roc_auc scores have a mean of 0.065% and a standard deviation of 0.05% so there seems to be going something, but what? I have plotted the features and they seem to be okay normally distributed. I also look that at the probabilities from predict_proba and they are mostly above 80% for the wrong class/label.
Any ideas what is going on and/or how to proper diagnose the problem?
I apologise for not being able to ask a more precise question but I'm lacking the vocabulary.

sklearn randomforest accuracy

for the same training and test datset, the accuracy of KNN is 0.53, for RandomForest and AdaBoost ,the accuracy is 1, can anyone help?
codes:
## prepare data
begin_date='20140101'
end_date='20160908'
stock_code='000001' #平安银行
data=ts.get_hist_data(stock_code,start=begin_date,end=end_date)
close=data.loc[:,'close']
df=data[:-1]
diff=np.array(close[1:])-np.array(close[:-1])
label=1*(diff>=0)
df.loc[:,'diff']=diff
df.loc[:,'label']=label
#split dataset into trainging and test
df_train=df[df.index<'2016-07-08']
df_test=df[df.index>='2016-07-08']
x_train=df_train[df_train.columns[:-1]]
y_train=df_train['label']
x_test=df_test[df_test.columns[:-1]]
y_test=df_test['label']
##KNN
clf2 = neighbors.KNeighborsClassifier()
clf2.fit(x_train, y_train)
accuracy2 = clf2.score(x_test, y_test)
pred_knn=np.array(clf2.predict(x_test))
#RandomForest
clf3 = RandomForestClassifier(n_estimators=100,n_jobs=-1)
clf3.fit(x_train, y_train)
accuracy3 = clf3.score(x_test, y_test)
pred_rf=np.array(clf3.predict(x_test))
print accuracy1,accuracy2,accuracy3
Different models give different accuracy on same dataset in most of the cases. For example, if you are trying to train and test a dataset with LogisticRegression and SVM, then it is highly likely that both the models will give different score. In order to choose best model for your data, you need to first explore the dataset and then choose some algorithm that performs better in the case.
Also, as your RandomForest and AdaBoost are giving an accuracy of 1, it is very likely that your model is 'overfitted'.

scikit-learn cross_validation over-fitting or under-fitting

I'm using scikit-learn cross_validation(http://scikit-learn.org/stable/modules/cross_validation.html) and get for example 0.82 mean score(r2_scorer).
How could I know do I have over-fitting or under-fitting using scikit-learn functions?
Unfortunately I confirm that there is no built-in tool to compare train and test scores in a CV setup. The cross_val_score tool only reports test scores.
You can setup your own loop with the train_test_split function as in Ando's answer but you can also use any other CV scheme.
import numpy as np
from sklearn.cross_validation import KFold
from sklearn.metrics import SCORERS
scorer = SCORERS['r2']
cv = KFold(5)
train_scores, test_scores = [], []
for train, test in cv:
regressor.fit(X[train], y[train])
train_scores.append(scorer(regressor, X[train], y[train]))
test_scores.append(scorer(regressor, X[test], y[test]))
mean_train_score = np.mean(train_scores)
mean_test_score = np.mean(test_scores)
If you compute the mean train and test scores with cross validation you can then find out if you are:
Underfitting: the train score is far from the perfect score (which is 1.0 for r2)
Overfitting: the train and test scores are not close from on another (the mean test score is significantly lower than the mean train score).
Note: you can be both significantly underfitting and overfitting at the same time if your model is inadequate and your data is too noisy.
You should compare your scores when testing on training and testing data. If the scores are close to equal, you are likely underfitting. If they are far apart, you are likely overfitting (unless using a method such as random forest).
To compute the scores for both train and test data, you can use something along the following (assuming your data is in variables X and Y):
from sklearn import cross_validation
#do five iterations
for i in range(5):
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, Y, test_size=0.4)
#Your predictor, linear SVM in this example
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
print "Test score", clf.score(X_test, y_test)
print "Train score", clf.score(X_train, y_train)

Resources