Calculating accuracy from precision, recall, f1-score - scikit-learn

Calculating accuracy from precision, recall, f1-score - scikit-learn - scikit-learn

I made a huge mistake. I printed output of scikit-learn svm accuracy as:
str(metrics.classification_report(trainExpected, trainPredict, digits=6))
Now I need to calculate accuracy from following output:
precision recall f1-score support
1 0.000000 0.000000 0.000000 1259
2 0.500397 1.000000 0.667019 1261
avg / total 0.250397 0.500397 0.333774 2520
Is it possible to calculate accuracy from these values?
PS: I don't want to spend another day for getting outputs of the model. I just realized this mistake hopefully I don't need to start from the beginning.

You can compute the accuracy from precision, recall and number of true/false positives or in your case support (even if precision or recall were 0 due to a 0 numerator or denominator).
TruePositive+FalseNegative=Support_True
TrueNegative+FalsePositive=Support_False
Precision=TruePositive/(TruePositive+FalsePositive) if TruePositive+FalsePositive!=0 else 0
Recall=TruePositive/(TruePositive+FalseNegative) if TruePositive+FalseNegative!=0 else 0
Accuracy=(TruePositive+TrueNegative)/(TruePositive+TrueNegative+FalsePositive+FalseNegative)
-or-
Given TruePositive/TrueNegative counts for example then:
TPP=TruePositive/Precision=TruePositive+FalsePositive if Precision!=0 and TruePositive!=0 else TPP=0
TPR=TruePositive/Recall=TruePositive+FalseNegative if Recall!=0 and TruePositive!=0 else TPR=0
In the above, when TruePositive==0, then no computation is possible without more information about FalseNegative/FalsePositive. Hence the support is better.
Accuracy=(TruePositive+TrueNegative)/(TPP+TPR-TruePositive+TrueNegative)
But in your case given was support so we use recall:
Recall=TruePositive/Support_True if Support_True!=0 else 0
TruePositive=Recall*Support_True, likewise TrueNegative=Recall_False*Support_False in all cases
Accuracy=(Recall*Support_True+Recall_False*Support_False)/(Support_True + Support_False)
In your case (0*1259+1*1261)/(1259+1261)=0.500397 which is exactly what you would expect when only one class is predicted. The respective precision score becomes the accuracy in that case.
Like the other poster said, better to use the library. But since this sounded like potentially a mathematical question as well, this can be used.

No need to spend more time on it. The metrics module has everything you need in it and you have already computed the predicted values. It's a one line change.
print(metrics.accuracy_score(trainExpected, trainPredict))
I suggest that you spend some time to read the linked page to learn more about evaluating models in general.
I do think you have a bigger problem with at hand -- you have zero predicted values for your 1 class, despite having balanced classes. You likely have a problem in your data, modeling strategy, or code that you'll have to deal with.

Related

Can a 2 sample statistical comparison have too large of a population size to be accurate?

I'm trying to do a simple comparison of two samples to determine if their means are different. Regardless of whether their standard deviations are equal/unequal, the formulas for a t-test or z-test are similar.
(i can't post images on a new account)
t-value w/ unequal variances:
https://www.biologyforlife.com/uploads/2/2/3/9/22392738/949234_orig.jpg
t-value w/ equal/pooled variances:
https://vitalflux.com/wp-content/uploads/2022/01/pooled-t-statistics-300x126.jpg
The issue here is the inverse and sqrt of sample size in the denominator that causes large samples to seem to have massive t-values.
For instance, I have 2 samples w/
size: N1=168,000 and N2=705,000
avgs: X1=89 and X2=49
stddev: S1=96 and S2=66 .
At first glance, these standard deviations are larger than the mean and suggest a nonhomogeneous sample with a lot of internal variation. When comparing the two samples, however, the denominator of the t-test becomes approx 0.25, suggesting that a 1 unit difference in means is equivalent to 4 standard deviations. Thus my t-value here comes out to around 160(!!)
All this to say, I'm just plugging in numbers since I didn't do many of these problems in advanced stats and haven't seen this formula since Stats110.
It makes some sense that two massive populations need their variance biased downward before comparing, but this seems like not the best test out there for the magnitude of what I'm doing.
What other tests are out there that I could try? What is the logic behind this seemingly over-biased variance?

How can MAPE be calculated if some of the actuals in the dataset are 0 values?

I am new to datascience and trying to understand difference evaluation in forecast vs actuals.
Lets say I have actuals:
27.580
25.950
0.000 (Sum = 53.53)
And my predicted values using XGboost are:
29.9
25.4
15.0 (Sum = 70.3)
Is it better to just evaluate based on the sum? example add all actuals minus all predicted? difference = 70.3 - 53.53?
Or is it better to evaluate the difference based on forecasting error techniques like MSE,MAE,RMSE,MAPE?
Since, I read MAPE is most widely accepted, how can it be implemented in cases where 0 is the denominator as can be seen in my actuals above?
Is there a better way to evaluate deviation from actuals or are these the only legitimate methods? My objective is to build more predictive models involving different variables which will give me different predicted values and then choose the one which has the least deviation from the actuals.

If you are to evaluate based on each on each point or the sum, depends on your data and your use case.
For example, if each point represents a time bucket, and the accuracy of each time bucket is important (for example for a production plan), then I would say it is required to evaluate for each bucket.
If you are to measure the accuracy of the sum, then you might as well also forecast based on the sum.
For your question on MAPE then there is no way around the issue you mention here. Your data need to be non-zero for MAPE to be valuable. If you are only to assess one time series then you can use the MAE instead, and then you do not have the issue of the accuracy being infinite/undefined.
But, there are many ways to measure accuracy, and my experience is that it very much depends on your use case and your data set which one that are preferable. See Hyndman's article on accuracy for intermittent demand, for some good points on accuracy measures.

I use MdAPE (Median Absolute Percentage Error) whenever MAPE is not possible to calculate due to 0s

calculating reliability of measurements

I have many measurements of age of the same person. Let's say:
[23 25 32 23 25]
I would like to output a single value and a reliability score of this value. The single value can be the average.
Reliability, I don't know well how to calculate it. The value should be between 0 and 1, where 1 means all ages are equal and a very unreliable measurement should be near 0.
Probably the variance should be used here, but it's not clear to me how to normalize it between 0 and 1 in a meaningful way (1/(x+1) is not much meaningful :)).

Assume some probability distribution (or determine what probability distribution your data fits most accurately). A good choice is a normal distribution, which for discrete data requires a continuity correction. See example here: http://www.milefoot.com/math/stat/pdfc-normaldisc.htm
In your example, your reliability score for the average age of 26 (25.6 rounded to nearest integer), is simply the probability that X falls in the range (25.5, 26.5).

The easiest way for assessing reliability (or internal consistency) is to use Cronbach's alpha. I guess most statistics software has this method built-in.
https://en.wikipedia.org/wiki/Cronbach%27s_alpha

Gaussian-Bernoulli RBM high reconstruction error

I'm normalizing my data to zero mean and unit variance as recommended in most literature to pre-train a GB-RBM. But whatever learning rate I choose and whatsoever is the number of epochs, my mean reconstruction error never drops below around 0.6.
Reconstruction errors for the stacked BB-RBMs easily drop to 0.01 within a few epochs. I've used several toolkits which implement GBRBMs as mentioned in http://www.cs.toronto.edu/~hinton/absps/guideTR.pdf but all have the same issue. Am I missing something or is the reconstruction error meant to stay above 50% ?
I'm normalizing my data by subtracting mean and dividing by the standard deviation along each dimension of input vector:
size(mfcc) --> [mlength rows x 39 cols]
mmean=mean(mfcc);
mstd=std(mfcc);
mfcc=mfcc-ones(mlength,1)*mmean;
mfcc=mfcc./(ones(mlength,1)*mstd);
This does give me zero mean and unit var along each dimension. I have tried different datasets, different features and different toolkits, but my reconstr error never drops below 0.6 for GBRBMs.
Thanks

I would guess you are using exp() as the sigmoid and then using a 3rd party library to do the matrix functions?
if the above is true, I would guess the 3rd party library is swallowing the exp() overflow errors but still stopping the calculation and so the hidden/recreated vectors are invalid.
edit based on comment below:
theano.tensor.nnet.sigmoid() is using exp() so I would first try switching to hard_sigmoid(). It won't be as nice of a curve, but it won't overflow/underflow so you can see if that is the source of error.
I assume you tried other data preprocessing and still had the high reconstruction errors?

Obtaining the Standard Error of Weighted Data in SPSS

I'm trying to find confidence intervals for the means of various variables in a database using SPSS, and I've run into a spot of trouble.
The data is weighted, because each of the people who was surveyed represents a different portion of the overall population. For example, one young man in our sample might represent 28000 young men in the general population. The problem is that SPSS seems to think that the young man's database entries each represent 28000 measurements when they actually just represent one, and this makes SPSS think we have much more data than we actually do. As a result SPSS is giving very very low standard error estimates and very very narrow confidence intervals.
I've tried fixing this by dividing every weight value by the mean weight. This gives plausible figures and an average weight of 1, but I'm not sure the resulting numbers are actually correct.
Is my approach sound? If not, what should I try?
I've been using the Explore command to find mean and standard error (among other things), in case it matters.

You do need to scale weights to the actual sample size, but only the procedures in the Complex Samples option are designed to account for sampling weights properly. The regular weight variable in Statistics is treated as a frequency weight.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string