I am trying to fit a model in using Gradient boosted machine, after selecting some features using roc-AUC and using a baseline to remove the features I don't need. Then I tried to fit the train set using GBM but I got an error message.
I implemented GBM
# lets drop roc-auc values below 0.54 baseline
x_train.drop(labels=removed_roc_values, axis=1, inplace=True)
x_test.drop(labels=removed_roc_values, axis=1, inplace=True)
x_train.shape, x_test.shape
The output of shape after dropping baseline features:((4930, 17), (2113, 23))
# using baseline GBM without tunning
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
from sklearn.grid_search import GridSearchCV
baseline = GradientBoostingClassifier(learning_rate=0.1,
n_estimators=100,max_depth=3, min_samples_split=2, min_samples_leaf=1,
subsample=1,max_features='sqrt', random_state=10)
baseline.fit(x_train,y_train)
predictors=list(x_train)
feat_imp = pd.Series(baseline.feature_importances_,
predictors).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Importance of Features')
plt.ylabel('Feature Importance Score')
print('Accuracy of the GBM on test set: {:.3f}'.format(baseline.score(x_test,
y_test)))
pred=baseline.predict(x_test)
print(classification_report(y_test, pred))
I expected to get the classification report, instead, I got the below error
ValueError: The number of features of the model must match the input.
Model
n_features is 17 and input n_features is 23
Thanks.
Related
As an R user, I wanted to also get up to speed on scikit.
Creating a linear regression model(s) is fine, but can't seem to find a reasonable way to get a standard summary of regression output.
Code example:
# Linear Regression
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LinearRegression
# Load the diabetes datasets
dataset = datasets.load_diabetes()
# Fit a linear regression model to the data
model = LinearRegression()
model.fit(dataset.data, dataset.target)
print(model)
# Make predictions
expected = dataset.target
predicted = model.predict(dataset.data)
# Summarize the fit of the model
mse = np.mean((predicted-expected)**2)
print model.intercept_, model.coef_, mse,
print(model.score(dataset.data, dataset.target))
Issues:
seems like the intercept and coef are built into the model, and I just type print (second to last line) to see them.
What about all the other standard regression output like R^2, adjusted R^2, p values, etc. If I read the examples correctly, seems like you have to write a function/equation for each of these and then print it.
So, is there no standard summary output for lin. reg. models?
Also, in my printed array of outputs of coefficients, there are no variable names associated with each of these? I just get the numeric array. Is there a way to print these where I get an output of the coefficients and the variable they go with?
My printed output:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)
152.133484163 [ -10.01219782 -239.81908937 519.83978679 324.39042769 -792.18416163
476.74583782 101.04457032 177.06417623 751.27932109 67.62538639] 2859.69039877
0.517749425413
Notes: Started off with Linear, Ridge and Lasso. I have gone through the examples. Below is for the basic OLS.
There exists no R type regression summary report in sklearn. The main reason is that sklearn is used for predictive modelling / machine learning and the evaluation criteria are based on performance on previously unseen data (such as predictive r^2 for regression).
There does exist a summary function for classification called sklearn.metrics.classification_report which calculates several types of (predictive) scores on a classification model.
For a more classic statistical approach, take a look at statsmodels.
I use:
import sklearn.metrics as metrics
def regression_results(y_true, y_pred):
# Regression metrics
explained_variance=metrics.explained_variance_score(y_true, y_pred)
mean_absolute_error=metrics.mean_absolute_error(y_true, y_pred)
mse=metrics.mean_squared_error(y_true, y_pred)
mean_squared_log_error=metrics.mean_squared_log_error(y_true, y_pred)
median_absolute_error=metrics.median_absolute_error(y_true, y_pred)
r2=metrics.r2_score(y_true, y_pred)
print('explained_variance: ', round(explained_variance,4))
print('mean_squared_log_error: ', round(mean_squared_log_error,4))
print('r2: ', round(r2,4))
print('MAE: ', round(mean_absolute_error,4))
print('MSE: ', round(mse,4))
print('RMSE: ', round(np.sqrt(mse),4))
statsmodels package gives a quiet decent summary
from statsmodels.api import OLS
OLS(dataset.target,dataset.data).fit().summary()
You can do using statsmodels
import statsmodels.api as sm
X = sm.add_constant(X.ravel())
results = sm.OLS(y,x).fit()
results.summary()
results.summary() will organize the results into three tabels
You can use the following option to have a summary table:
import statsmodels.api as sm
#log_clf = LogisticRegression()
log_clf =sm.Logit(y_train,X_train)
classifier = log_clf.fit()
y_pred = classifier.predict(X_test)
print(classifier.summary2())
Use model.summary() after predict
# Linear Regression
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LinearRegression
# load the diabetes datasets
dataset = datasets.load_diabetes()
# fit a linear regression model to the data
model = LinearRegression()
model.fit(dataset.data, dataset.target)
print(model)
# make predictions
expected = dataset.target
predicted = model.predict(dataset.data)
# >>>>>>>Print out the statistics<<<<<<<<<<<<<
model.summary()
# summarize the fit of the model
mse = np.mean((predicted-expected)**2)
print model.intercept_, model.coef_, mse,
print(model.score(dataset.data, dataset.target))
I don't have any example data to share in order to replicate the problem, but perhaps someone can provide a high level answer. I've created a lot of logistic regression models in the past, and this is the first time my predict proba scores are showing up as either 1 or 0.
I'm creating a binary classifier to predict one of two labels. I've also used a couple of other algorithms, XGBClassifier and RandomForestCalssifier with the same dataset. For these, predict_proba yields the expected probability results (i.e, float values between 0 and 1).
Also, for the LogisticRegression model, I've tried a variety of parameters including all default params, yet the issue persists. Weirdly enough, using SGDClassifier with loss = 'log' or 'modified_huber' also yields the same binary predict_proba results, so I'm thinking this might be something intrinsic to the dataset, but not sure. Also, this issue only occurs if I standardize training set data. So far I've tried both StandardScaler and MinMaxScaler, same results.
Has anyone ever encountered a problem such as this?
Edit:
The LR parameters are:
LogisticRegression(C=1.7993269963183343, class_weight='balanced', dual=False,
fit_intercept=True, intercept_scaling=1, l1_ratio=.5,
max_iter=100, multi_class='warn', n_jobs=-1, penalty='elasticnet',
random_state=58, solver='saga', tol=0.0001, verbose=0,
warm_start=False)
Again, the issue only occurs when standardizing the data with either StandardScaler() or MinMaxScaler(). Which is odd because the data is not a uniform scale across all features. For instance, some features are represented as percents, others are represented as dollar values, and others are dummy coded representations.
This can happen when you do the following two things in sequence:
Fit an estimator with standardized training data and then later on,
Pass unstandardized data to the same estimator in the validation or testing phase.
Here's an example of predict_proba returning 0 or 1 using the UCI ML Breast Cancer Wisconsin (Diagnostic) dataset:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=123)
# Example 1 [CORRECT]
pipeline = make_pipeline(StandardScaler(), LogisticRegression())
pipeline.fit(X_train, y_train)
# Pipeline(steps=[('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression())])
print(pipeline)
y_pred = pipeline.predict_proba(X_test)
# [0.37264656 0.62735344]
print(y_pred.mean(axis=0))
# Example 2 [INCORRECT]
# Fit the model with standardized training set
X_scaled = StandardScaler().fit_transform(X_train)
model = LogisticRegression()
model.fit(X_scaled, y_train)
# Test the model with unstandardized test set
y_pred = model.predict_proba(X_test)
# [1.00000000e+000 2.48303123e-204]
print(y_pred.mean(axis=0))
Since the estimator in Example 2 was fitted on scaled data with a unit variance of 1.0 (X_scaled), the variance of the data it's being tested on (X_test) is much higher than expected. It's no surprise then that this results in very extreme probabilities.
You can prevent this from happening by wrapping your estimator within a pipeline and calling the pipeline fit method instead of the estimator's fit method (see Example 1). Doing it this way guarantees that the same transformations are applied to the data in the training, validation and testing phases.
I'm trying to create a Naive Bayes Classifier in Python. For finding the accuracy of the classifier, I have train and test data explicitly available, and I want to train my model using train.csv and then test it on test.csv.
Is there a function except scikit test_train_split which can help me doing that?
Following from comment above:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import mean_squared_error
# Create an instance
nv_clf = GaussianNB()
# Fit on training set
nv_clf.fit(X_train, y_train)
# Pedict on X_test
y_pred = nv_clif.predict(X_test)
# Calcuate error/accuracy on y_test
nv_mse = mean_squared_error(y_test, y_pred)
# or
nv_rmse = np.sqrt(nv_mse) # root mean squared error
I am new to machine learning and I am building my first model independently. I have a dataset that evaluates cars, it contains features of price, safety and luxury and classifies if its good, very good, acceptable and unacceptable. I converted all the non-numeric columns into numeric, trained the data and predicted with a test set. However, my predictions are awful; I used LinearRegression and r2_score outputs 0.05 which is practically 0. I have tried a few different models and all have been giving me horrible predictions and accuracy.
What am I doing wrong? I have seen tutorials, read articles with similar methodology, yet they end up with 0.92 accuracy and I'm getting 0.05. How do you make a good model for your data and how do you know which model to use?
Code:
import numpy as np
import pandas as pd
from sklearn import preprocessing, linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import seaborn as sns
import matplotlib.pyplot as plt
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
columns = ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class value']
df = pd.read_csv('car.data.txt', index_col=False, names=columns)
for col in df.columns.values:
try:
if df[col].astype(int):
pass
except ValueError:
enc = preprocessing.LabelEncoder()
enc.fit(df[col])
df[col] = enc.transform(df[col])
#Split the data
class_y = df.pop('class value')
x_train, x_test, y_train, y_test = train_test_split(df, class_y, test_size=0.2, random_state=0)
#Make the model
regression_model = linear_model.LinearRegression()
regression_model = regression_model.fit(x_train, y_train)
#Predict the test data
y_pred = regression_model.predict(x_test)
score = r2_score(y_test, y_pred)
You should not use Linear Regression, which is used for predicting continuous values and not categorical values. In your case what you are trying to predict is categorical. Technically, each situation is a class.
I would suggest trying Logistic Regression or other type of classification methods such as Naive Bayes, SVM , decision tree classifiers etc. instead.
I'm using scikit-learn cross_validation(http://scikit-learn.org/stable/modules/cross_validation.html) and get for example 0.82 mean score(r2_scorer).
How could I know do I have over-fitting or under-fitting using scikit-learn functions?
Unfortunately I confirm that there is no built-in tool to compare train and test scores in a CV setup. The cross_val_score tool only reports test scores.
You can setup your own loop with the train_test_split function as in Ando's answer but you can also use any other CV scheme.
import numpy as np
from sklearn.cross_validation import KFold
from sklearn.metrics import SCORERS
scorer = SCORERS['r2']
cv = KFold(5)
train_scores, test_scores = [], []
for train, test in cv:
regressor.fit(X[train], y[train])
train_scores.append(scorer(regressor, X[train], y[train]))
test_scores.append(scorer(regressor, X[test], y[test]))
mean_train_score = np.mean(train_scores)
mean_test_score = np.mean(test_scores)
If you compute the mean train and test scores with cross validation you can then find out if you are:
Underfitting: the train score is far from the perfect score (which is 1.0 for r2)
Overfitting: the train and test scores are not close from on another (the mean test score is significantly lower than the mean train score).
Note: you can be both significantly underfitting and overfitting at the same time if your model is inadequate and your data is too noisy.
You should compare your scores when testing on training and testing data. If the scores are close to equal, you are likely underfitting. If they are far apart, you are likely overfitting (unless using a method such as random forest).
To compute the scores for both train and test data, you can use something along the following (assuming your data is in variables X and Y):
from sklearn import cross_validation
#do five iterations
for i in range(5):
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, Y, test_size=0.4)
#Your predictor, linear SVM in this example
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
print "Test score", clf.score(X_test, y_test)
print "Train score", clf.score(X_train, y_train)