My basic understanding about Residuals Plot was that it's (standardized) residuals vs Fitted ( Predicated ) value. But doing a google search lead me to a few sites that mentioned that it's the residuals vs independent value.
[http://stattrek.com/regression/residual-analysis.aspx]
Here is a site that describes my initial understanding..
[http://blog.minitab.com/blog/adventures-in-statistics/why-you-need-to-check-your-residual-plots-for-regression-analysis]
What's statistically correct method ?
I am certain it's residuals plot is between the residuals and the predicated value. Excel Regression also automatically generates this in the Regression result set.
Related
Python module statsmodels has some methods for Generalised linear model (GLM) as described in https://www.statsmodels.org/stable/glm.html
I am wondering if there any method available to obtain the estimate of error variance for the model fit?
Any pointer will be very helpful
The results instance has a scale attribute that is computed by default using pearson residuals for distributions that do not have a fixed scale = 1 like Poisson and Binomial.
If the family is gaussian, then scale is an estimate of the residual variance. For non-gaussian families, the variance is not constant and scale is the dispersion parameter.
I am confused by this example here: https://scikit-learn.org/stable/visualizations.html
If we plot the ROC curve for a Logistic Regression Classifier the ROC curve is parametrized by the threshold parameter. But a usual SVM spits out binary values instead of probabilities.
Consequently there should not be a threshold which can be varied to obtain an ROC curve.
But which parameter is then varied in the example above?
SVMs have a measure of confidence in their predictions using the distance from the separating hyperplane (before the kernel, if you're not doing a linear SVM). These are obviously not probabilities, but they do rank-order the data points, and so you can get an ROC curve. In sklearn, this is done via the decision_function method. (You can also set probability=True in the SVC to calibrate the decision function values into probability estimates.)
See this section of the User Guide for some of the details on the decision function.
I am modelling the correlation between two seasonal time series using a SARIMAX coupled model (statsmodels.tsa.statespace.sarimax.SARIMAX). The endogenous variable is y(t) and the exogenous variable is x(t). My objective is to forecast y(t) using x(t) as a predictor.
My understanding about the SARIMAX(p,d,q,r)∙(P,D,Q)s process is to model y(t) as a linear function of x(t) with error terms that follow a SARIMA process:
y(t)=c+β1∙x(1,t)+β2∙x(2,t)+⋯+βr∙x(r,t)+ε(t) (See complete equation in the link
SARIMAX model
For the aforementioned problem I have 2 questions: (1) Is it possible to model a lag regression order "r" higher than 1?
(2) Once estimated the SARIMAX model equation, which is the best way to forecast future values of y(t) as a function of x(t) considering x(t) is a known variable in the future? I've seen many examples for uni-variate SARIMA forecasting but not for multi-variate SARIMAX forecasting.
Many thanks.
I was trying to plot ROC curve and Precision-Recall curve in graph. The points are generated from the Spark Mllib BinaryClassificationMetrics. By following the following Spark https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html
[(1.0,1.0), (0.0,0.4444444444444444)] Precision
[(1.0,1.0), (0.0,1.0)] Recall
[(1.0,1.0), (0.0,0.6153846153846153)] - F1Measure
[(0.0,1.0), (1.0,1.0), (1.0,0.4444444444444444)]- Precision-Recall curve
[(0.0,0.0), (0.0,1.0), (1.0,1.0), (1.0,1.0)] - ROC curve
It looks like you have a similar problem to what I experienced. You need to either flip your parameters to the Metrics constructor or perhaps pass in the probability instead of the prediction. So, for example, if you are using the BinaryClassificationMetrics and a RandomForestClassifier, then according to this page (under outputs) there is "prediction" and "probability".
Then initialize your Metrics thus:
new BinaryClassificationMetrics(predictionsWithResponse
.select(col("probability"),col("myLabel"))
.rdd.map(r=>(r.getAs[DenseVector](0)(1),r.getDouble(1))))
With the DenseVector call used to extract the probability of the 1 class.
As for actual plotting, that's up to you (many fine tools for that), but at least you will get more than 1 point on you curve (besides the endpoints).
And in case it's not clear:
metrics.roc().collect() will give you the data for the ROC curve: Tuples of: (false positive rate, true positive rate).
One option of the SVM classifier (SVC) is probability which is false by default. The documentation does not say what it does. Looking at libsvm source code, it seems to do some sort of cross-validation.
This option does not exist for LinearSVC nor OneSVM.
I need to calculate AUC scores for several SVM models, including these last two. Should I calculate the AUC score using decision_function(X) as the thresholds?
Answering my own question.
Firstly, it is a common "myth" that you need probabilities to draw the ROC curve. No, you need some kind of threshold in your model that you can change. The ROC curve is then drawn by changing this threshold. The point of the ROC curve being, of course, to see how well your model is reproducing the hypothesis by seeing how well it is ordering the observations.
In the case of SVM, there are two ways I see people drawing ROC curves for them:
using distance to the decision bondary, as I mentioned in my own question
using the bias term as your threshold in the SVM: http://researchgate.net/post/How_can_I_plot_determine_ROC_AUC_for_SVM. In fact, if you use SVC(probabilities=True) then probabilities will be calculated for you in this manner, by using CV, which you can then use to draw the ROC curve. But as mentioned in the link I provide, it is much faster if you draw the ROC curve directly by varying the bias.
I think #2 is the same as #1 if we are using a linear kernel, as in my own case, because varying the bias is varying the distance in this particular case.
In order to calculate AUC, using sklearn, you need a predict_proba method on your classifier; this is what the probability parameter on SVC does (you are correct that it's calculated using cross-validation). From the docs:
probability : boolean, optional (default=False)
Whether to enable probability estimates. This must be enabled prior to calling fit, and will slow down that method.
You can't use the decision function directly to compute AUC, since it's not a probability. I suppose you could scale the decision function to take values in the range [0,1], and compute AUC, however I'm not sure what statistical properties this will have; you certainly won't be able to use it to compare with ROC calculated using probabilities.