scikit-learn cross_validation over-fitting or under-fitting - scikit-learn

I'm using scikit-learn cross_validation(http://scikit-learn.org/stable/modules/cross_validation.html) and get for example 0.82 mean score(r2_scorer).
How could I know do I have over-fitting or under-fitting using scikit-learn functions?

Unfortunately I confirm that there is no built-in tool to compare train and test scores in a CV setup. The cross_val_score tool only reports test scores.
You can setup your own loop with the train_test_split function as in Ando's answer but you can also use any other CV scheme.
import numpy as np
from sklearn.cross_validation import KFold
from sklearn.metrics import SCORERS
scorer = SCORERS['r2']
cv = KFold(5)
train_scores, test_scores = [], []
for train, test in cv:
regressor.fit(X[train], y[train])
train_scores.append(scorer(regressor, X[train], y[train]))
test_scores.append(scorer(regressor, X[test], y[test]))
mean_train_score = np.mean(train_scores)
mean_test_score = np.mean(test_scores)
If you compute the mean train and test scores with cross validation you can then find out if you are:
Underfitting: the train score is far from the perfect score (which is 1.0 for r2)
Overfitting: the train and test scores are not close from on another (the mean test score is significantly lower than the mean train score).
Note: you can be both significantly underfitting and overfitting at the same time if your model is inadequate and your data is too noisy.

You should compare your scores when testing on training and testing data. If the scores are close to equal, you are likely underfitting. If they are far apart, you are likely overfitting (unless using a method such as random forest).
To compute the scores for both train and test data, you can use something along the following (assuming your data is in variables X and Y):
from sklearn import cross_validation
#do five iterations
for i in range(5):
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, Y, test_size=0.4)
#Your predictor, linear SVM in this example
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
print "Test score", clf.score(X_test, y_test)
print "Train score", clf.score(X_train, y_train)

Related

k-NN GridSearchCV taking extremely long time to execute

I am attempting to use sklearn to train a KNN model on the MNIST classification task. When I try to tune my parameters using either sklearn's GridSearchCV or RandomisedSearchCV classes, my code is taking an extremely long time to execute.
As an experiment, I created a KNN model using KNeighborsClassifier() with the default parameters and passed these same parameters to GridSearchCV. Afaik, this should mean GridSearchCV only has single set of parameters and so should effectively not perform a "search". I then called the .fit() methods of both on the training data and timed their execution (see code below). The KNN model's .fit() method took about 11 seconds to run, whereas the GridSearchCV model took over 20 minutes.
I understand that GridSearchCV should take slightly longer as it is performing 5-fold cross validation, but the difference in execution time seems too large for it to be explained by that.
Am I doing something with my GridSearchCV call that it causing it to take such a long time to execute? And is there anything that I can do to accelerate it?
import sklearn
import time
# importing models
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
# Importing data
from sklearn.datasets import fetch_openml
mnist = fetch_openml(name='mnist_784')
print("data loaded")
# splitting the data into stratified train & test sets
X, y = mnist.data, mnist.target # mnist mj.data.shape is (n_samples, n_features)
sss = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 0)
for train_index, test_index in sss.split(X,y):
X_train, y_train = X[train_index], y[train_index]
X_test, y_test = X[test_index], y[test_index]
print("data split")
# Data has no missing values and is preprocessed, so no cleaing needed.
# using a KNN model, as recommended
knn = KNeighborsClassifier()
print("model created")
print("training model")
start = time.time()
knn.fit(X_train, y_train)
end = time.time()
print(f"Execution time for knn paramSearch was: {end-start}")
# Parameter tuning.
# starting by performing a broad-range search on n_neighbours to work out the
# rough scale the parameter should be on
print("beginning param tuning")
params = {'n_neighbors':[5],
'weights':['uniform'],
'leaf_size':[30]
}
paramSearch = GridSearchCV(
estimator = knn,
param_grid = params,
cv=5,
n_jobs = -1)
start = time.time()
paramSearch.fit(X_train, y_train)
end = time.time()
print(f"Execution time for knn paramSearch was: {end-start}")
With vanilla KNN, the costly procedure is predicting, not fitting: fitting just saves a copy of the data, and then predicting has to do the work of finding nearest neighbors. So since your search involves scoring on each test fold, that's going to take a lot more time than just fitting. A better comparison would have you predict on the training set in the no-search section.
However, sklearn does have different options for the algorithm parameter, which aim to trade away some of the prediction complexity for added training time, by building a search structure so that fewer comparisons are needed at prediction time. With the default algorithm='auto', you're probably building a ball tree, and so the effect of the first paragraph won't be so profound. I suspect this is still the issue though: now the training time will be non-neglibible, but the scoring portion in the search is what takes most of the time.

Cross-validation using when using the Gaussian Naive Bayes model

Well, I am trying to solve this clustering problem that involves the Gaussian Naive-Bayes algorithm.
Question:
Classification
Consider the data in the file - link below. Train the algorithm Gaussian Naive Bayes using the method of cross-validation holdout (Use the first 700 lines for the training set and the rest for the test set.) What is the accuracy of the training set? What is the accuracy of the test set? Do the same training with the method Leave-One-Out. What is the average accuracy for the training set? What is the average accuracy for the test set?
My solution that I am not sure about:
Basic Code (Full code in the Collab link below):
#Using Holdout
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred_train = classifier.predict(X_train)
cm0 = confusion_matrix(y_train, y_pred_train )
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(accuracy_score(y_test, y_pred))
print(accuracy_score(y_train, y_pred_train))
My Answer for the holdout:
[[ 23 51]
[ 21 205]]
0.76
0.7871428571428571
LOO:
#Using LOO
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import LeaveOneOut
#This is where I got the code: https://machinelearningmastery.com/loocv-for-evaluating-machine-learning-algorithms/
cv = LeaveOneOut()
accuracies = cross_val_score(estimator=classifier, X = X_train, y = y_pred_train, scoring='accuracy',cv=cv)
print(f"Accuracy Train {accuracies.mean()}")
print(f"Standard Deviation {accuracies.std()}")
accuraciestest = cross_val_score(estimator=classifier, X = X_test, y = y_test, scoring='accuracy', cv=cv)
print(f"Accuracy Test {accuraciestest.mean()}")
print(f"Standard Deviation Test {accuraciestest.std()}")
My Answer for the LeaveOneOut:
Accuracy Train 0.9771428571428571
Standard Deviation 0.1494479637785374
Accuracy Test 0.7433333333333333
Standard Deviation Test 0.43679387460092534
Data:
https://drive.google.com/file/d/1v9V-007yV3vVckPcQN0Q5VuNZYF_JjBW/view?usp=sharing
Colabs Link: https://colab.research.google.com/drive/1X68-Li6FacnAAQ4ASg3mmqdrU15v2ReP?usp=sharing

Sklearn Logistic Regression predict_proba returning 0 or 1

I don't have any example data to share in order to replicate the problem, but perhaps someone can provide a high level answer. I've created a lot of logistic regression models in the past, and this is the first time my predict proba scores are showing up as either 1 or 0.
I'm creating a binary classifier to predict one of two labels. I've also used a couple of other algorithms, XGBClassifier and RandomForestCalssifier with the same dataset. For these, predict_proba yields the expected probability results (i.e, float values between 0 and 1).
Also, for the LogisticRegression model, I've tried a variety of parameters including all default params, yet the issue persists. Weirdly enough, using SGDClassifier with loss = 'log' or 'modified_huber' also yields the same binary predict_proba results, so I'm thinking this might be something intrinsic to the dataset, but not sure. Also, this issue only occurs if I standardize training set data. So far I've tried both StandardScaler and MinMaxScaler, same results.
Has anyone ever encountered a problem such as this?
Edit:
The LR parameters are:
LogisticRegression(C=1.7993269963183343, class_weight='balanced', dual=False,
fit_intercept=True, intercept_scaling=1, l1_ratio=.5,
max_iter=100, multi_class='warn', n_jobs=-1, penalty='elasticnet',
random_state=58, solver='saga', tol=0.0001, verbose=0,
warm_start=False)
Again, the issue only occurs when standardizing the data with either StandardScaler() or MinMaxScaler(). Which is odd because the data is not a uniform scale across all features. For instance, some features are represented as percents, others are represented as dollar values, and others are dummy coded representations.
This can happen when you do the following two things in sequence:
Fit an estimator with standardized training data and then later on,
Pass unstandardized data to the same estimator in the validation or testing phase.
Here's an example of predict_proba returning 0 or 1 using the UCI ML Breast Cancer Wisconsin (Diagnostic) dataset:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=123)
# Example 1 [CORRECT]
pipeline = make_pipeline(StandardScaler(), LogisticRegression())
pipeline.fit(X_train, y_train)
# Pipeline(steps=[('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression())])
print(pipeline)
y_pred = pipeline.predict_proba(X_test)
# [0.37264656 0.62735344]
print(y_pred.mean(axis=0))
# Example 2 [INCORRECT]
# Fit the model with standardized training set
X_scaled = StandardScaler().fit_transform(X_train)
model = LogisticRegression()
model.fit(X_scaled, y_train)
# Test the model with unstandardized test set
y_pred = model.predict_proba(X_test)
# [1.00000000e+000 2.48303123e-204]
print(y_pred.mean(axis=0))
Since the estimator in Example 2 was fitted on scaled data with a unit variance of 1.0 (X_scaled), the variance of the data it's being tested on (X_test) is much higher than expected. It's no surprise then that this results in very extreme probabilities.
You can prevent this from happening by wrapping your estimator within a pipeline and calling the pipeline fit method instead of the estimator's fit method (see Example 1). Doing it this way guarantees that the same transformations are applied to the data in the training, validation and testing phases.

Scaling of stock data

I am trying to apply machine learning on stock prediction, and I run into problem regarding scaling on future unseen (much higher) stock close value.
Lets say I use random forrest regression on predicting stock price. I break the data into train set and test set.
For the train set, I use standardscaler, and do fit and transform
And then I use regressor to fit
For the test set, I use standardscaler, and do transform
And then I use regressor to predict, and compare to test label
If I plot predict and test label on a graph, predict seems to max out or ceiling. The problem is that standardscaler fit on train set, test set (later in the timeline) have much higher value, the algorithm does not know what to do with these extreme data
def test(X, y):
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle=False)
# preprocess the data
pipeline = Pipeline([
('std_scaler', StandardScaler()),
])
# model = LinearRegression()
model = RandomForestRegressor(n_estimators=20, random_state=0)
# preprocessing fit transform on train data
X_train = pipeline.fit_transform(X_train)
# fit model on train data with train label
model.fit(X_train, y_train)
# transform on test data
X_test = pipeline.transform(X_test)
# predict on test data
y_pred = model.predict(X_test)
# print(np.sqrt(mean_squared_error(y_test, y_pred)))
d = {'actual': y_test, 'predict': y_pred}
plot_data = pd.DataFrame.from_dict(d)
sns.lineplot(data=plot_data)
plt.show()
What should be done with the scaling?
This is what I got for plotting prediction, actual close price vs time
The problem mainly comes from the model you are using. RandomForest regressor is created upon Decision Trees. It is learning to map an input to an output for every examples in the training set. Consequently RandomForest regressor will work for middle values but for extreme values that it hasn't seen during training it will of course perform has your picture is showing.
What you want, is to learn a function directly using linear/polynomial regression or more advanced algorithms like ARIMA.

Is passing sklearn tfidf matrix to train MultinomialNB model proper?

I'm do some text classification tasks. What I have observed is that if fed tfidf matrix(from sklearn's TfidfVectorizer), Logistic Regression model is always outperforming MultinomialNB model. Below is my code for training both:
X = df_new['text_content']
y = df_new['label']
X_train, X_test, y_train, y_test = train_test_split(X, y)
vectorizer = TfidfVectorizer(stop_words='english')
X_train_dtm = vectorizer.fit_transform(X_train)
X_test_dtm = vectorizer.transform(X_test)
clf_lr = LogisticRegression()
clf_lr.fit(X_train_dtm, y_train)
y_pred = clf_lr.predict(X_test_dtm)
lr_score = accuracy_score(y_test, y_pred) # perfectly balanced binary classes
clf_mnb = MultinomialNB()
clf_mnb.fit(X_train_dtm, y_train)
y_pred = clf_mnb.predict(X_test_dtm)
mnb_score = accuracy_score(y_test, y_pred) # perfectly balanced binary classes
Currently lr_score > mnb_score always. I'm wondering how exactly MultinomialNB is using the tfidf matrix since the term frequency in tfidf is calculated based on no class information. Any chance that I should not feed tfidf matrix to MultinomialNB the same way I did to LogisticRegression?
Update: I understand the difference between results of TfidfVectorizer and CountVectorizer. And I also just checked the sources code of sklearn's MultinomialNB.fit() function, looks like it does expect a count as oppose to frequency. This will also explain the performance boost mentioned in my comment below. However, I'm still wondering if under any circumstances pass tfidf into MultinomialNB makes sense. The sklearn documentation briefly mentioned the possibility, but not much details.
Any advice would be much appreciated!

Resources