LASSO fit scikit-learn - obtain likelihood - scikit-learn

I'm using LASSO from scikit-learn package to optimize the parameters of a penalized linear regression problem. I'm not only interested in the optimal choice of parameters, but also in the likelihood of the data with respect to the optimized parameters. Is there an easy way to get the full likelihood after fitting?

It is slightly deceiving to consider the lasso in a maximum likelihood framework. The prior distribution on the coefficients is then a laplacian distribution exp(-np.prod(np.abs(coef))), which yields sparsity only as an "artifact" at its optimum. The probability of obtaining a sparse sample from this distribution is 0 (it happens "almost never").
This disclaimer out of the way, you can write
import numpy as np
from sklearn.linear_model import Lasso
est = Lasso(alpha=10.)
est.fit(X, y)
coef = est.coef_
data_loss = 0.5 * ((X.dot(coef) - y) ** 2).sum()
n_samples, n_features = X.shape
penalty = n_samples * est.alpha * np.abs(coef).sum()
likelihood = np.exp(-(data_loss + penalty))

Related

Understanding Python's roc_curve, svm example

I'm trying to understand how this Python code works, conceptually, so I can write a paper about it. I have an analogous question for the random forest algorithm; but maybe if I understand this, I'll understand that too. Here's just the part that I think is relevant to my question:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_curve, auc
from numpy import interp
statifiedFolds = StratifiedKFold(n_splits=5, shuffle=True)
tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)
i = 1
for train,test in statifiedFolds.split(x,y):
svc = SVC(kernel = 'rbf', C = 10000, gamma = 0.1)
x_train, x_test = x[train], x[test]
y_train, y_test = y[train], y[test]
svc.fit(x_train, y_train)
y_pred = svc.decision_function(x_test)
fpr, tpr, thresholds = roc_curve(y_test,y_pred)
tprs.append(interp(mean_fpr, fpr, tpr))
tprs[-1][0] = 0.0
roc_auc = auc(fpr, tpr)
aucs.append(roc_auc)
i += 1
mean_tpr = np.mean(tprs, axis=0)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
As I understand, the ROC curve plots false positive rate against true positive rate. But each time you run SVM on the testing set, you get a single binary prediction for each testing point. You then calculate the true positive rate and false positive rate by tallying true positives and false positives. So tpr should be just a single number, as should fpr. Thus (tpr,fpr) should be just a single point.
This leads me to expect that to get an roc curve, one should run the classification algorithm under many different parameters. If you're lucky, the algorithm will have a parameter such that larger values tends to benefit sensitivity at the expense of specificity, or the other way around. But neither of SVM's parameters (C and gamma) do that. So I would have thought you'd have to try many values of C and gamma until the left, middle and right regions of the roc curve are all well-represented.
But this code looks nothing like that. Only one pair of parameter values (C=10000, gamma = 0.1) ever gets called. And svm is run only once, followed by a call of an interpolation function, within each fold of the 5-fold cross-validation.
My question is: How is it possible to interpolate the roc curve using only 1 point?
The mistake in this reasoning lies in the fact that svc.decision_function(x_test) is not returning binary data.
It's actually returning a (signed) value, proportional to the distance of the samples X to the separating hyperplane. You can therefore plot a proper roc curve by adjusting the threshold around the default value of 0.
NB: See the reference documentation for details, svc.decision_function will return slightly different formats depending on the decision_function_shape argument of svc.

How can this feature ranking problem be implemented with Support Vector Classification?

If I want the classifier to be SVM (using scikit-learn), how can I modify the 'clf' variable such that the svm classifier used for feature ranking results to high accuracy? What parameters/arguments do I need to add ? Which kernel type of SVC ('linear' or 'rbf' or 'sigmoid' or others) would you suggest for best accuracy?
The codes are referred form the following github link:
https://github.com/CynthiaKoopman/Network-Intrusion-Detection/blob/master/RandomForest_IDS.ipynb
I have 10 features which are ranked (with RecursiveFeatureElimination of scikit learn) from 1 to 10 which are from DoS attacks of the NSL-KDD dataset using RandomForestClassifier with 99% accuracy (using RFC as prediction model).
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
#from sklearn.svm import SVC
# Create a decision tree classifier. clf is the 'variable for classifier'
clf = RandomForestClassifier(n_jobs = 2)
# If classifier used is svm
#clf = SVC(kernel = "linear")
#rank all features, i.e continue the elimination until the last one
rfe = RFE(clf, n_features_to_select=1)
rfe.fit(X_newDoS, Y_DoS)
print ("DoS Features sorted by their rank:")
#print (sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), newcolname_DoS)))
sorted_newcolname_DoS = sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), newcolname_DoS))
sorted_newcolname_DoS
I expect more or less 99% similarity between the ranked features of the two classifiers, which I didnt observe.

How do Pseudo-residuals work in a Gradient Boosting Machine (GBM)?

So in a GBM, each tree predicts the 'pseudo-residuals' of the prior tree [1].
I'm not sure exactly how these 'pseudo-residuals' work but I wonder how this plays out when you have a combination of:
Binary classification problem
Low response rate
A reasonably low signal-to-noise ratio
In the example below, we have all 3. I calculate residuals as Actual - Probability and since the response is binary, you end up with this highly bi-modal distribution which is nearly identical to the response.
Decreasing the response rate further exacerbates the bi-modal distribution since the probabilities are closer to zero and, hence, the distributions are even closer to either 0 or 1.
So I have a few questions here:
How exactly would pseudo residuals be calculated in this example? (I am fairly sure this is wrong, aside from just the fact that the initial tree models difference from the global mean)
Would the second tree be nearly identical to the first as a result?
Are successive trees in a GBM more similar for problems with lower response rates?
Does down-sampling on non-response inherently change the model as a result?
.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.metrics import roc_auc_score
train_percent = 0.8
num_rows = 10000
remove_rate = 0.1
# Generate data
X, y = make_classification(n_samples=num_rows, flip_y=0.55)
# Remove response rows to make sample unbalanced
remove = (np.random.random(len(y)) > remove_rate) & (y == 1)
X, y = X[~remove], y[~remove]
print("Response Rate: " + str(sum(y) / float(len(y))))
# Get train/test samples (data is pre-shuffled)
train_rows = int(train_percent * len(X))
X_train , y_train = X[:train_rows], y[:train_rows]
X_test , y_test = X[train_rows:], y[train_rows:]
# Fit a simple decision tree
clf = DecisionTreeClassifier(max_depth=4)
clf.fit(X_train, y_train)
pred = clf.predict_proba(X_test)[:,1]
# Calculate roc auc
roc_auc = roc_auc_score(y_test, pred)
print("ROC AUC: " + str(roc_auc))
# Plot residuals
plt.style.use('ggplot')
plt.hist(y_test - pred);
plt.title('Residuals')

taking the gradient of huber loss in theano

I have two functions that are suppose to produce equal results: f1(x,theta)=f2(x,theta).
Given input x, I need to find the parameters theta that makes this equality hold as well as possible.
Initially I was thinking of using squared loss and minimizing (f1(x,theta)-f2(x,theta))^2 and solving via SGD.
However I was thinking of making the loss more precise and using huber (or absolute loss) of the difference.
Huber loss is a piecewise function (ie initially it is quadratic and then it changes into a linear function).
How can I take the gradient of my huber loss in theano?
A pretty simple implementation of huber loss in theano can be found here
Here is a code snippet
import theano.tensor as T
delta = 0.1
def huber(target, output):
d = target - output
a = .5 * d**2
b = delta * (abs(d) - delta / 2.)
l = T.switch(abs(d) <= delta, a, b)
return l.sum()
The function huber will return a symbolic representation of the loss which you can then plug in theano.tensor.grad to get the gradient and use it to minimize using SGD

Difference in SGD classifier results and statsmodels results for logistic with l1

As a check on my work, I've been comparing the output of scikit learn's SGDClassifier logistic implementation with statsmodels logistic. Once I add some l1 in combination with categorical variables, I'm getting very different results. Is this a result of different solution techniques or am I not using the correct parameter?
Much bigger differences on my own dataset, but still pretty large using mtcars:
df = sm.datasets.get_rdataset("mtcars", "datasets").data
y, X = patsy.dmatrices('am~standardize(wt) + standardize(disp) + C(cyl) - 1', df)
logit = sm.Logit(y, X).fit_regularized(alpha=.0035)
clf = SGDClassifier(alpha=.0035, penalty='l1', loss='log', l1_ratio=1,
n_iter=1000, fit_intercept=False)
clf.fit(X, y)
gives:
sklearn: [-3.79663192 -1.16145654 0.95744308 -5.90284803 -0.67666106]
statsmodels: [-7.28440744 -2.53098894 3.33574042 -7.50604097 -3.15087396]
I've been working through some similar issues. I think the short answer might be that SGD doesn't work so well with only a few samples, but is (much more) performant with larger data. I'd be interested in hearing from sklearn devs. Compare, for example, using LogisticRegression
clf2 = LogisticRegression(penalty='l1', C=1/.0035, fit_intercept=False)
clf2.fit(X, y)
gives very similar to l1 penalized Logit.
array([[-7.27275526, -2.52638167, 3.32801895, -7.50119041, -3.14198402]])

Resources