pROC test difference from AUC = 0.5 - proc-r-package

I would like to know how can I use pROC package to test whether a given AUC value is statistically significantly different from an AUC = 0.5. I can get confidence intervals but I would also like to report the associated p-values.
Thank you in advance for you time and help!

You can't do that directly with pROC, but you don't need to. Once you realize that asking whether the curve = 0.5 is actually equivalent to asking whether the median values of your two classes are equal, you can easily perform a Wilcoxon Rank Sum Tests:
wilcox.test(cases, controls)

Related

How can MAPE be calculated if some of the actuals in the dataset are 0 values?

I am new to datascience and trying to understand difference evaluation in forecast vs actuals.
Lets say I have actuals:
27.580
25.950
0.000 (Sum = 53.53)
And my predicted values using XGboost are:
29.9
25.4
15.0 (Sum = 70.3)
Is it better to just evaluate based on the sum? example add all actuals minus all predicted? difference = 70.3 - 53.53?
Or is it better to evaluate the difference based on forecasting error techniques like MSE,MAE,RMSE,MAPE?
Since, I read MAPE is most widely accepted, how can it be implemented in cases where 0 is the denominator as can be seen in my actuals above?
Is there a better way to evaluate deviation from actuals or are these the only legitimate methods? My objective is to build more predictive models involving different variables which will give me different predicted values and then choose the one which has the least deviation from the actuals.
If you are to evaluate based on each on each point or the sum, depends on your data and your use case.
For example, if each point represents a time bucket, and the accuracy of each time bucket is important (for example for a production plan), then I would say it is required to evaluate for each bucket.
If you are to measure the accuracy of the sum, then you might as well also forecast based on the sum.
For your question on MAPE then there is no way around the issue you mention here. Your data need to be non-zero for MAPE to be valuable. If you are only to assess one time series then you can use the MAE instead, and then you do not have the issue of the accuracy being infinite/undefined.
But, there are many ways to measure accuracy, and my experience is that it very much depends on your use case and your data set which one that are preferable. See Hyndman's article on accuracy for intermittent demand, for some good points on accuracy measures.
I use MdAPE (Median Absolute Percentage Error) whenever MAPE is not possible to calculate due to 0s

How do I calculate confidence interval with only sample size and confidence level

I'm writing a program that lets users run simulates on a subset of data, and as part of this process, the program allows a user to specify what sample size they want based on confidence level and confidence interval. Assuming a p value of .5 to maximum sample size, and given that I know the population size, I can calculate the sample size. For example, if I have:
Population = 54213
Confidence Level = .95
Confidence Interval = 8
I get Sample Size 150. I use the formula outlined here:
https://www.surveysystem.com/sample-size-formula.htm
What I have been asked to do is reverse the process, so that confidence interval is calculated using a given sample size and confidence level (and I know the population). I'm having a horrible time trying to reverse this equation and was wondering if there is a formula. More importantly, does this seem like an intelligent thing to do? Because this seems like a weird request to me.
I should mention (just to be clear) that the CI is estimated for the mean, not the population. In that case, if we assume the population is normally distributed and that we know the population standard deviation SD, then the CI is estimated as
From this formula you would also get your formula, where you are estimating n.
If the population SD is not known then you need to replace the z-value with a t-value.

standard error of addition, subtraction, multiplication and ratio

Let's say, I have two random variables,x and y, both of them have n observations. I've used a forecasting method to estimate xn+1 and yn+1, and I also got the standard error for both xn+1 and yn+1. So my question is that what the formula would be if I want to know the standard error of xn+1 + yn+1, xn+1 - yn+1, (xn+1)*(yn+1) and (xn+1)/(yn+1), so that I can calculate the prediction interval for the 4 combinations. Any thought would be much appreciated. Thanks.
Well, the general topic you need to look at is called "change of variables" in mathematical statistics.
The density function for a sum of random variables is the convolution of the individual densities (but only if the variables are independent). Likewise for the difference. In special cases, that convolution is easy to find. For example, for Gaussian variables the density of the sum is also a Gaussian.
For product and quotient, there aren't any simple results, except in special cases. For those, you might as well compute the result directly, maybe by sampling or other numerical methods.
If your variables x and y are not independent, that complicates the situation. But even then, I think sampling is straightforward.

How to identify data points that are significantly smaller than the others in a data set?

I have an array of data points of real value. I wish to identify those data points whose values are significantly smaller than others. Are there any well-known algorithms?
For example, the data set can be {0.01, 0.32, 0.45, 0.68, 0.87, 0.95, 1.0}. I can manually tell that 0.01 is significantly smaller than the others. However, I would like to know are there any analysis method for this purpose in statistics area? I tried outlier detection in my data set, but it cannot find any outliers (such as detecting 0.01 as outlier).
I have deleted a segment I wrote explaining the use of zscores for your problem but it was incorrect, I hope the information below is accurate, just in case, use it as a guide only...
The idea is to build a z-distribution from the scores you are testing, minus the test score, and then use that distribution to get a zscore of the test score. Any z greater than 1.96 is unlikely to belong to your test population.
I am not that this works properly because you remove your tests score' influence from the distribution, thus large scores will have inflated zscores because they contribute to a greater variance (the denominator in the zscore equation).
This could be a start till someone with a modicum of expertise sets us straight :)
e.g.
for i = 1:length(data_set)
test_score = data_set(i)
sample_pop = data_set(data_set~=test_score)
sample_mean = mean(sample_pop)
sample_stdev = std(sample_pop)
test_z(i) = (i-sample_mean)/sample_stdev
end
This can be done for higher dimensions by using the dim input for mean.

How do i prove that my derived equation and the Monte-Carlo simulation are equivalent?

I have derived and implemented an equation of an expected value.
To show that my code is free of errors i have employed the Monte-Carlo
computation a number of times to show that it converges into the same
value as the equation that i derived.
As I have the data now, how can i visualize this?
Is this even the correct test to do?
Can I give a measure how sure i am that the results are correct?
It's not clear what you mean by visualising the data, but here are some ideas.
If your Monte Carlo simulation is correct, then the Monte Carlo estimator for your quantity is just the mean of the samples. The variance of your estimator (how far away from the 'correct' value the average value will be) will scale inversely proportional to the number of samples you take: so long as you take enough, you'll get arbitrarily close to the correct answer. So, use a moderate (1000 should suffice if it's univariate) number of samples, and look at the average. If this doesn't agree with your theoretical expectation, then you have an error somewhere, in one of your estimates.
You can also use a histogram of your samples, again if they're one-dimensional. The distribution of samples in the histogram should match the theoretical distribution you're taking the expectation of.
If you know the variance in the same way as you know the expectation, you can also look at the sample variance (the mean squared difference between the sample and the expectation), and check that this matches as well.
EDIT: to put something more 'formal' in the answer!
if M(x) is your Monte Carlo estimator for E[X], then as n -> inf, abs(M(x) - E[X]) -> 0. The variance of M(x) is inversely proportional to n, but exactly what it is will depend on what M is an estimator for. You could construct a specific test for this based on the mean and variance of your samples to see that what you've done makes sense. Every 100 iterations, you could compute the mean of your samples, and take the difference between this and your theoretical E[X]. If this decreases, you're probably error free. If not, you have issues either in your theoretical estimate or your Monte Carlo estimator.
Why not just do a simple t-test? From your theoretical equation, you have the true mean mu_0 and your simulators mean,mu_1. Note that we can't calculate mu_1, we can only estimate it using the mean/average. So our hypotheses are:
H_0: mu_0 = mu_1 and H_1: mu_0 does not equal mu_1
The test statistic is the usual one-sample test statistic, i.e.
T = (mu_0 - x)/(s/sqrt(n))
where
mu_0 is the value from your equation
x is the average from your simulator
s is the standard deviation
n is the number of values used to calculate the mean.
In your case, n is going to be large, so this is equivalent to a Normal test. We reject H_0 when T is bigger/smaller than (-3, 3). This would be equivalent to a p-value < 0.01.
A couple of comments:
You can't "prove" that the means are equal.
You mentioned that you want to test a number of values. One possible solution is to implement a Bonferroni type correction. Basically, you reduce your p-value to: p-value/N where N is the number of tests you are running.
Make your sample size as large as possible. Since we don't have any idea about the variability in your Monte Carlo simulation it's impossible to say use n=....
The value of p-value < 0.01 when T is bigger/smaller than (-3, 3) just comes from the Normal distribution.

Resources