How do Pseudo-residuals work in a Gradient Boosting Machine (GBM)? - statistics

So in a GBM, each tree predicts the 'pseudo-residuals' of the prior tree [1].
I'm not sure exactly how these 'pseudo-residuals' work but I wonder how this plays out when you have a combination of:
Binary classification problem
Low response rate
A reasonably low signal-to-noise ratio
In the example below, we have all 3. I calculate residuals as Actual - Probability and since the response is binary, you end up with this highly bi-modal distribution which is nearly identical to the response.
Decreasing the response rate further exacerbates the bi-modal distribution since the probabilities are closer to zero and, hence, the distributions are even closer to either 0 or 1.
So I have a few questions here:
How exactly would pseudo residuals be calculated in this example? (I am fairly sure this is wrong, aside from just the fact that the initial tree models difference from the global mean)
Would the second tree be nearly identical to the first as a result?
Are successive trees in a GBM more similar for problems with lower response rates?
Does down-sampling on non-response inherently change the model as a result?
.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.metrics import roc_auc_score
train_percent = 0.8
num_rows = 10000
remove_rate = 0.1
# Generate data
X, y = make_classification(n_samples=num_rows, flip_y=0.55)
# Remove response rows to make sample unbalanced
remove = (np.random.random(len(y)) > remove_rate) & (y == 1)
X, y = X[~remove], y[~remove]
print("Response Rate: " + str(sum(y) / float(len(y))))
# Get train/test samples (data is pre-shuffled)
train_rows = int(train_percent * len(X))
X_train , y_train = X[:train_rows], y[:train_rows]
X_test , y_test = X[train_rows:], y[train_rows:]
# Fit a simple decision tree
clf = DecisionTreeClassifier(max_depth=4)
clf.fit(X_train, y_train)
pred = clf.predict_proba(X_test)[:,1]
# Calculate roc auc
roc_auc = roc_auc_score(y_test, pred)
print("ROC AUC: " + str(roc_auc))
# Plot residuals
plt.style.use('ggplot')
plt.hist(y_test - pred);
plt.title('Residuals')

Related

Understanding Python's roc_curve, svm example

I'm trying to understand how this Python code works, conceptually, so I can write a paper about it. I have an analogous question for the random forest algorithm; but maybe if I understand this, I'll understand that too. Here's just the part that I think is relevant to my question:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_curve, auc
from numpy import interp
statifiedFolds = StratifiedKFold(n_splits=5, shuffle=True)
tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)
i = 1
for train,test in statifiedFolds.split(x,y):
svc = SVC(kernel = 'rbf', C = 10000, gamma = 0.1)
x_train, x_test = x[train], x[test]
y_train, y_test = y[train], y[test]
svc.fit(x_train, y_train)
y_pred = svc.decision_function(x_test)
fpr, tpr, thresholds = roc_curve(y_test,y_pred)
tprs.append(interp(mean_fpr, fpr, tpr))
tprs[-1][0] = 0.0
roc_auc = auc(fpr, tpr)
aucs.append(roc_auc)
i += 1
mean_tpr = np.mean(tprs, axis=0)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
As I understand, the ROC curve plots false positive rate against true positive rate. But each time you run SVM on the testing set, you get a single binary prediction for each testing point. You then calculate the true positive rate and false positive rate by tallying true positives and false positives. So tpr should be just a single number, as should fpr. Thus (tpr,fpr) should be just a single point.
This leads me to expect that to get an roc curve, one should run the classification algorithm under many different parameters. If you're lucky, the algorithm will have a parameter such that larger values tends to benefit sensitivity at the expense of specificity, or the other way around. But neither of SVM's parameters (C and gamma) do that. So I would have thought you'd have to try many values of C and gamma until the left, middle and right regions of the roc curve are all well-represented.
But this code looks nothing like that. Only one pair of parameter values (C=10000, gamma = 0.1) ever gets called. And svm is run only once, followed by a call of an interpolation function, within each fold of the 5-fold cross-validation.
My question is: How is it possible to interpolate the roc curve using only 1 point?
The mistake in this reasoning lies in the fact that svc.decision_function(x_test) is not returning binary data.
It's actually returning a (signed) value, proportional to the distance of the samples X to the separating hyperplane. You can therefore plot a proper roc curve by adjusting the threshold around the default value of 0.
NB: See the reference documentation for details, svc.decision_function will return slightly different formats depending on the decision_function_shape argument of svc.

Confidence score too low

Im wondering why the model score is very low, only 0.13, i already make sure the data is clean, scaled, and also have high correlation between each features but the model score using linear regression is very low, why is this happening and how to solve this? this is my code
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing
path = r"D:\python projects\avocado.csv"
df = pd.read_csv(path)
df = df.reset_index(drop=True)
df.set_index('Date', inplace=True)
df = df.drop(['Unnamed: 0','year','type','region','AveragePrice'],1)
df.rename(columns={'4046':'Small HASS sold',
'4225':'Large HASS sold',
'4770':'XLarge HASS sold'},
inplace=True)
print(df.head)
sns.heatmap(df.corr())
sns.pairplot(df)
df.plot()
_=plt.xticks(rotation=20)
forecast_line = 35
df['target'] = df['Total Volume'].shift(-forecast_line)
X = np.array(df.drop(['target'], 1))
X = preprocessing.scale(X)
X_lately = X[-forecast_line:]
X = X[:-forecast_line]
df.dropna(inplace=True)
y = np.array(df['target'])
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2)
lr = LinearRegression()
lr.fit(X_train,y_train)
confidence = lr.score(X_test,y_test)
print(confidence)
this is the link of the dataset i use https://www.kaggle.com/neuromusic/avocado-prices
So the score function you are using is:
Return the coefficient of determination R^2 of the prediction.
The coefficient R^2 is defined as (1 - u/v), where u is the residual
sum of squares ((y_true - y_pred) ** 2).sum() and v is the total sum
of squares ((y_true - y_true.mean()) ** 2).sum(). The best possible
score is 1.0 and it can be negative (because the model can be
arbitrarily worse). A constant model that always predicts the expected
value of y, disregarding the input features, would get a R^2 score of
0.0.
So as you realise you are already above the the constant prediction.
My advice try to plot your data, to see what kind of regression you should use. Here you can see an overview which type of linear regression are available: https://scikit-learn.org/stable/modules/linear_model.html
Logistic regression makes sense if your data has a logistic curve, which means that your points are either close to 0 or to 1, and in the middle are not so many points.

scikit-learn LogisticRegressionCV: best coefficients

I am trying to understand how the best coefficients are calculated in a logistic regression cross-validation, where the "refit" parameter is True.
If I understand the docs correctly, the best coefficients are the result of first determining the best regularization parameter "C", i.e., the value of C that has the highest average score over all folds. Then, the best coefficients are simply the coefficients that were calculated on the fold that has the highest score for the best C. I assume that if the maximum score is achieved by several folds, the coefficients of these folds would be averaged to give the best coefficients (I didn't see anything on how this case is handled in the docs).
To test my understanding, I determined the best coefficients in two different ways:
directly from the coef_ attribute of the fitted model, and
from the coefs_paths attribute, which contains the path of the coefficients obtained during cross-validating across each fold and then across each C.
The results I get from 1. and 2. are similar but not identical, so I was hoping someone could point out what I am doing wrong here.
Thanks!
An example to demonstrate the issue:
from sklearn.datasets import load_breast_cancer
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Set parameters
n_folds = 10
C_values = [0.001, 0.01, 0.05, 0.1, 1., 100.]
# Load and preprocess data
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
X_train_scaled = StandardScaler().fit_transform(X_train)
# Fit model
clf = LogisticRegressionCV(Cs=C_values, cv=n_folds, penalty='l1',
refit=True, scoring='roc_auc',
solver='liblinear', random_state=0,
fit_intercept=False)
clf.fit(X_train_scaled, y_train)
########################
# Get and plot coefficients using method 1
########################
coefs1 = clf.coef_
coefs1_series = pd.Series(coefs1.ravel(), index=cancer['feature_names'])
coefs1_series.sort_values().plot(kind="barh")
########################
# Get and plot coefficients using method 2
########################
# mean of scores of class "1"
scores = clf.scores_[1]
mean_scores = np.mean(scores, axis=0)
# Get index of the C that has the highest average score across all folds
best_C_idx = np.where(mean_scores==np.max(mean_scores))[0][0]
# Get index (here: indices) of the folds with highest scores for the
# best C
best_folds_idx = np.where(scores[:, best_C_idx]==np.max(scores[:, best_C_idx]))[0]
paths = clf.coefs_paths_[1] # has shape (n_folds, len(C_values), n_features)
coefs2 = np.squeeze(paths[best_folds_idx, best_C_idx, :])
coefs2 = np.mean(coefs2, axis=0)
coefs2_series = pd.Series(coefs2.ravel(), index=cancer['feature_names'])
coefs2_series.sort_values().plot(kind="barh")
I think this article answers your question: https://orvindemsy.medium.com/understanding-grid-search-randomized-cvs-refit-true-120d783a5e94.
The key point is the refit parameter of LogisticRegressionCV.
According to sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html)
refitbool, default=True
If set to True, the scores are averaged across all folds, and the coefs and the C that corresponds to the best score is taken, and a final refit is done using these parameters. Otherwise the coefs, intercepts and C that correspond to the best scores across folds are averaged.
Best.

Shouldn't a SVM binary classifier understand the threshold from the training set?

I'm very confused about SVM classifiers and I'm sorry if I'll sound stupid.
I'm using the Spark library for java http://spark.apache.org/docs/latest/mllib-linear-methods.html, the first example from the Linear Support Vector Machines paragraph. On this training set:
1 1:10
1 1:9
1 1:9
1 1:9
0 1:1
1 1:8
1 1:8
0 1:2
0 1:2
0 1:3
the prediction on values: 8, 2 and 1 are all positive (1). Given the training set, I would expect them to be positive, negative, negative. It gives negative only on 0 or negative values. I read that the standard threshold is "positive" if the prediction is a positive double, "negative" if it's negative, and I've seen that there is a method to manually set the threshold. But isn't this the exact reason I need a binary classifier for? I mean, if I know in advance what the threshold is I can distinguish between positive and negative values, so why bother training a classifier?
UPDATE:
Using this python code from a different library:
X = [[10], [9],[9],[9],[1],[8],[8],[2],[2],[3]]
y = [1,1,1,1,0,1,1,0,0,0]
​
from sklearn.svm import SVC
from sklearn.cross_validation import StratifiedKFold
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
import numpy as np
​
# we convert our list of lists in numpy arrays
X = np.array(X)
y = np.array(y)
# we compute the general accuracy of the system - we need more "false questions" to continue the study
accuracy = []
​
#we do 10 fold cross-validation - to be sure to test all possible combination of training and test
kf_total = StratifiedKFold(y, n_folds=5, shuffle=True)
for train, test in kf_total:
X_train, X_test = X[train], X[test]
y_train, y_test = y[train], y[test]
print X_train
clf = SVC().fit(X_train, y_train)
y_pred = clf.predict(X_test)
print "the classifier says: ", y_pred
print "reality is: ", y_test
print accuracy_score(y_test, y_pred)
print ""
accuracy.append(accuracy_score(y_test, y_pred))
print sum(accuracy)/len(accuracy)
the results are correct:
######
1 [0]
######
2 [0]
######
8 [1]
So I think it's possible for a SVM classifier to understand the threshold by itself; how can I do the same with the spark library?
SOLVED: I solved the issue changing the example to this:
SVMWithSGD std = new SVMWithSGD();
std.setIntercept(true);
final SVMModel model = std.run(training.rdd());
From this:
final SVMModel model = SVMWithSGD.train(training.rdd(), numIterations);
The standard value for "intercept" is false, which is what I needed to be true.
If you search for probability calibration you will find some research on a related matter (recalibrating the outputs to return better scores).
If your problem is a binary classification problem, you can calculate the slope of the cost by assigning vales to true/false positive/negative options multiplied by the class ratio. You can then form a line with the given AUC curve that intersects at only one point to find a point that is in some sense optimal as a threshold for your problem.
Threshold is one value that will differentiate classes .

LASSO fit scikit-learn - obtain likelihood

I'm using LASSO from scikit-learn package to optimize the parameters of a penalized linear regression problem. I'm not only interested in the optimal choice of parameters, but also in the likelihood of the data with respect to the optimized parameters. Is there an easy way to get the full likelihood after fitting?
It is slightly deceiving to consider the lasso in a maximum likelihood framework. The prior distribution on the coefficients is then a laplacian distribution exp(-np.prod(np.abs(coef))), which yields sparsity only as an "artifact" at its optimum. The probability of obtaining a sparse sample from this distribution is 0 (it happens "almost never").
This disclaimer out of the way, you can write
import numpy as np
from sklearn.linear_model import Lasso
est = Lasso(alpha=10.)
est.fit(X, y)
coef = est.coef_
data_loss = 0.5 * ((X.dot(coef) - y) ** 2).sum()
n_samples, n_features = X.shape
penalty = n_samples * est.alpha * np.abs(coef).sum()
likelihood = np.exp(-(data_loss + penalty))

Resources