Negative R2 on training data for linear regression - scikit-learn

Using scikit-learn to fit a one dimensional model, without an intercept:
lm = sklearn.linear_models.LinearRegression(fit_intercept=False).
lm.fit(x, y)
When evaluating the score using the training data I get a negative .score().
lm.score(x, y)
-0.00256
Why? Does the R2 score compare the variance of my intercept-less model with a model with an intercept?
(Note that it is the same data that I used to fit the model.)

From Wikipedia article on R^2:
Important cases where the computational definition of R2 can yield
negative values, depending on the definition used, arise [...] where
linear regression is conducted without including an intercept.
(emphasis mine).

Related

Evaluating Gaussian Mixture model using a score metric?

I have 1D data (on column data). I used Gaussian Mixture Model (GMM) as a density estimation, using this implementation in Python: https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html. By relying on AIC/BIC criteron i was able to determine number of components. After i fit the GMM, i plotted kernel density estimation of original observation + that of sampled data drawn from GMM. the plot of original and sampled desnities are quiet similar( that is good). But, i would like some metrics to report how good is the fitted model.
g = GaussianMixture(n_components = 35)
data= df['x'].values.reshape(-1,1) # data taken from data frame (10,000 data pints)
clf= g.fit(data)# fit model
samples= clf.sample(10000)[0] # generate sample data points (same # as original data points)
I found score in the implementation, but not sure how to implememnt. Am i doing it wrong? or is there any better way to show how accuracy is the fitted model, apart from histogram or kernel densities plots?.
print(clf.score(data))
print(clf.score(samples))
You can use normalized_mutual_info_score, adjusted_rand_score or silhouette score to evaluate your clusters. All of these metrics are implemented under sklearn.metrics section.
EDIT: You can check this link for more detail explanations.
In a summary:
Adjusted Rand Index: measures the similarity of the two assignments.
Normalized Mutual Information: measures the agreement of the two assignments.
Silhouette Coefficient: measures how well-assigned each individual point is.
gmm.fit(x_vec)
pred = gmm.predict(x_vec)
print ("gmm: silhouttte: ", silhouette_score(x_vec, pred))
I would better use cross-validation and try to see the accuracy of the trained model.
Use the predict method of the fitted model to predict the labels of unseen data (use cross-validation and report the acurracy): https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture.predict
Toy example:
g = GaussianMixture(n_components = 35)
g.fit(train_data)# fit model
y_pred = g.predict(test_data)
EDIT:
There are several options to measure the performance of your unsupervised case. For GMM, which base on real probabilities, the most common are BIC and AIC. They are immediatly included in the scikit GMM class.

r2_score is -18.709, Why?

I'm conducting multiple linear regression in Python, ML. To the best of my knowledge r2_score supposed to be in the range of -1 to 1. But, I obtained -18.709.
What is the problem to obtain this answer and how can I correct it? Its coding and result look as follows:
calculate R
from SK-learn.meterics import r2_score
score = r2_score(y_test, y_pred)
print(score)
The output:
-18.7097
Its prediction result is as follows:
y_pred = model.predict(X_test)
print(y_pred)
Result:
[ 25000. 123000. 73000. 103000.]
The coefficient of determination r-square is defined as
Nash–Sutcliffe model efficiency coefficient (Explanation below)
There are cases where the computational definition of R2 can yield
negative values, depending on the definition used. This can arise when
the predictions that are being compared to the corresponding outcomes
have not been derived from a model-fitting procedure using those data.
Even if a model-fitting procedure has been used, R2 may still be
negative, for example when linear regression is conducted without
including an intercept, or when a non-linear function is used to fit
the data. In cases where negative values arise, the mean of the data
provides a better fit to the outcomes than do the fitted function
values, according to this particular criterion. Since the most general
definition of the coefficient of determination is also known as the
Nash–Sutcliffe model efficiency coefficient, this last notation is
preferred in many fields, because denoting a goodness-of-fit indicator
that can vary from −∞ to 1 (i.e., it can yield negative values) with a
squared letter is confusing.
SOURCE: wikipedia

Interpreting coefficientMatrix, interceptVector and Confusion matrix on multinomial logistic regression

Can anyone explain how to interpret coefficientMatrix, interceptVector , Confusion matrix
of a multinomial logistic regression.
According to Spark documentation:
Multiclass classification is supported via multinomial logistic (softmax) regression. In multinomial logistic regression, the algorithm produces K sets of coefficients, or a matrix of dimension K×J where K is the number of outcome classes and J is the number of features. If the algorithm is fit with an intercept term then a length K vector of intercepts is available.
I turned an example using spark ml 2.3.0 and I got this result.
.
If I analyse what I get :
The coefficientMatrix has dimension of 5 * 11
The interceptVector has dimension of 5
If so,why the Confusion matrix has a dimension of 4 * 4 ?
Also, can anyone give an interpretation of coefficientMatrix, interceptVector ?
Why I get negative coefficients ?
If 5 is the number of classes after classification, why I get 4 rows in the confusion matrix ?
EDIT
I forgot to mention that I am still beginner in machine learning and that my search in google didn't help, so maybe I get an Up Vote :)
Regarding the 4x4 confusion matrix: I imagine that when you split your data into test and train, there were 5 classes present in your training set and only 4 classes present in your test set. This can easily happen if the distribution of your response variable is imbalanced.
You'll want to try to perform some stratified split between test and train prior to modeling. If you are working with pyspark, you may find this library helpful: https://github.com/databricks/spark-sklearn
Now regarding negative coefficients for a multi-class Logistic Regression: As you mentioned, your returned coefficientMatrix shape is 5x11.
Spark generated five models via one-vs-all approach. The 1st model corresponds to the model where the positive class is the 1st label and the negative class is composed of all other labels. Lets say the 1st coefficient for this model is -2.23. In order to interpret this coefficient we take the exponential of -2.23 which is (approx) 0.10. Interpretation here: 'With one unit increase of 1st feature we expect a reduced odds of the positive label by 90%'

Modelling probabilities in a regularized (logistic?) regression model in python

I would like to fit a regression model to probabilities. I am aware that linear regression is often used for this purpose, but I have several probabilities at or near 0.0 and 1.0 and would like to fit a regression model where the output is constrained to lie between 0.0 and 1.0. I want to be able to specify a regularization norm and strength for the model and ideally do this in python (but an R implementation would be helpful as well). All the logistic regression packages I've found seem to be only suited for classification whereas this is a regression problem (albeit one where I want to use the logit link function). I use scikits-learn for my classification and regression needs so if this regression model can be implemented in scikits-learn, that would be fantastic (it seemed to me that this is not possible), but I'd be happy about any solution in python and/or R.
The question has two issues, penalized estimation and fractional or proportions data as dependent variable. I worked on each separately but never tried the combination.
Penalization
Statsmodels has had L1 regularized Logit and other discrete models like Poisson for some time. In recent months there has been a lot of effort to support more penalization but it is not in statsmodels yet. Elastic net for linear and Generalized Linear Model (GLM) is in a pull request and will be merged soon. More penalized GLM like L2 penalization for GAM and splines or SCAD penalization will follow over the next months based on pull requests that still need work.
Two examples for the current L1 fit_regularized for Logit are here
Difference in SGD classifier results and statsmodels results for logistic with l1 and https://github.com/statsmodels/statsmodels/blob/master/statsmodels/examples/l1_demo/short_demo.py
Note, the penalization weight alpha can be a vector with zeros for coefficients like the constant if they should not be penalized.
http://www.statsmodels.org/dev/generated/statsmodels.discrete.discrete_model.Logit.fit_regularized.html
Fractional models
Binary and binomial models in statsmodels do not impose that the dependent variable is binary and work as long as the dependent variable is in the [0,1] interval.
Fractions or proportions can be estimated with Logit as Quasi-maximum likelihood estimator. The estimates are consistent if the mean function, logistic, cumulative normal or similar link function, is correctly specified but we should use robust sandwich covariance for proper inference. Robust standard errors can be obtained in statsmodels through a fit keyword cov_type='HC0'.
Best documentation is for Stata http://www.stata.com/manuals14/rfracreg.pdf and the references therein. I went through those references before Stata had fracreg, and it works correctly with at least Logit and Probit which were my test cases. (I don't find my scripts or test cases right now.)
The bad news for inference is that robust covariance matrices have not been added to fit_regularized, so the correct sandwich covariance is not directly available. The standard covariance matrix and standard errors of the parameter estimates are derived under the assumption that the model, i.e. the likelihood function, is correctly specified, which will not be the case if the data are fractions and not binary.
Besides using Quasi-Maximum Likelihood with binary models, it is also possible to use a likelihood that is defined for fractional data in (0, 1). A popular model is Beta regression, which is also waiting in a pull request for statsmodels and is expected to be merged within the next months.

Regarding Probability Estimates predicted by LIBSVM

I am attempting 3 class classification by using SVM classifier. How do we interpret the probabililty estimates predicted by LIBSVM. Is it based on perpendicular distance of the instance from the maximal margin hyperplane?.
Kindly through some light on the interpretation of probability estimates predicted by LIBSVM classifier. Parameters C and gamma are first tuned and then probability estimates are outputted by using -b option with both training and testing.
Multiclass SVM is always decomposed into several binary classifiers (typically a set of one vs all classifiers). Any binary SVM classifier's decision function outputs a (signed) distance to the separating hyperplane. In short, an SVM maps the input domain to a one-dimensional real number (the decision value). The predicted label is determined by the sign of the decision value. The most common technique to obtain probabilistic output from SVM models is through so-called Platt scaling (paper of LIBSVM authors).
Is it based on perpendicular distance of the instance from the maximal margin hyperplane?
Yes. Any classifier that outputs such a one-dimensional real value can be post-processed to yield probabilities, by calibrating a logistic function on the decision values of the classifier. This is the exact same approach as in standard logistic regression.
SVM performs binary classification. In order to achieve multiclass classification libsvm performs what it's called one vs all. What you get when you invoke -bis the probability related to this technique that you can found explained here .

Resources