How do I calculate confidence interval with only sample size and confidence level - statistics

I'm writing a program that lets users run simulates on a subset of data, and as part of this process, the program allows a user to specify what sample size they want based on confidence level and confidence interval. Assuming a p value of .5 to maximum sample size, and given that I know the population size, I can calculate the sample size. For example, if I have:
Population = 54213
Confidence Level = .95
Confidence Interval = 8
I get Sample Size 150. I use the formula outlined here:
https://www.surveysystem.com/sample-size-formula.htm
What I have been asked to do is reverse the process, so that confidence interval is calculated using a given sample size and confidence level (and I know the population). I'm having a horrible time trying to reverse this equation and was wondering if there is a formula. More importantly, does this seem like an intelligent thing to do? Because this seems like a weird request to me.

I should mention (just to be clear) that the CI is estimated for the mean, not the population. In that case, if we assume the population is normally distributed and that we know the population standard deviation SD, then the CI is estimated as
From this formula you would also get your formula, where you are estimating n.
If the population SD is not known then you need to replace the z-value with a t-value.

Related

Generating multiple sets of n samples from data set where standard deviation of each set is minimized

I prepared a dataset and later learned that it is skewed.
Assuming a plot of user_count vs score where user_count is the number of users on that particular score.
I have to sample out the total users in multiple samples of size [100<=n<=1000] in such a way that the standard deviation of each created sample is minimized.
How do I do that?
I have tried binning methods like custom binning, quantile, etc. but that is not helpful to me, as by manual binning some of my bins have high SD.
Example:
I created 19 custom bins of interval: .05-.10, .10-.15, ......, .90-.95, >.95
this gives me something like this:
the problem here is: q19 has a high SD.
so, I am trying to figure out a way using which I can create an optimal number of bins automatically with minimum standard deviations.

sample size for a single arm study based on median time to event

In my master thesis, I need to determine and calculate the number of cases for median time to event. The method is according to Brookmeyer & Crowley, 1982. My question is: How can I determine the sample size according to Brookmeyer? So determine the number of cases for median time to event. How can I define the equation for N? I know how to calculate the confidence interval, but my problem, how do I determine the case number theoretically for this.
Edit:
"Designing the trial with different characteristics: planning a single arm study without historical control. How can I determine the sample size N and what method is the best", this is my plan. Assuming "Median Time to event "PFS" ". I want to determine the sample size N and then calculate it, that's why I thought that I can clearly use or find a formula for N. I firmly assume that the survival time is exponentially distributed I want to see with it: 1- Sample size based on distributional assumptions? 2- No implementation available? How to derive p-value? Thanks for further help, best regards

Descriptive statistics, percentiles

I am stuck in a statistics assignment, and would really appreciate some qualified help.
We have been given a data set and are then asked to find the 10% with the lowest rate of profit, in order to decide what Profit rate is the maximum in order to be considered for a program.
the data has:
Mean = 3,61
St. dev. = 8,38
I am thinking that i need to find the 10th percentile, and if i run the percentile function in excel it returns -4,71.
However I tried to run the numbers by hand using the z-score.
where z = -1,28
z=(x-μ)/σ
Solving for x
x= μ + z σ
x=3,61+(-1,28*8,38)=-7,116
My question is which of the two methods is the right one? if any at all.
I am thoroughly confused at this point, hope someone has the time to help.
Thank you
This is the assignment btw:
"The Danish government introduces a program for economic growth and will
help the 10 percent of the rms with the lowest rate of prot. What rate
of prot is the maximum in order to be considered for the program given
the mean and standard deviation found above and assuming that the data
is normally distributed?"
The excel formula is giving the actual, empirical 10th percentile value of your sample
If the data you have includes all possible instances of whatever you’re trying to measure, then go ahead and use that.
If you’re sampling from a population and your sample size is small, use a t distribution or increase your sample size. If your sample size is healthy and your data are normally distributed, use z scores.
Short story is the different outcomes suggest the data you’ve supplied are not normally distributed.

Verify transmit power to be within certain limits of its expected value over 95% of test measurements

I have a requirement where I have to verify the transmit power out of a device as measured at its connector is within 2 dB of its expected value over 95% of test measurements.
I am using a signal analyzer to analyze the transmitted power. I only get the average power value, min, max and stdDev of the measurements and not the individual power measurements.
Now, the question is how would I verify the "95% thing" using average power, min, max and stdDev. It seems that I can use normal distribution to find the 95% confidence level.
I would appreciate if someone can help me on this.
Thanks in anticipation
The way I'm reading this, it seems you are a statistical beginner, so if I'm wrong there, the rest of this answer will probably be insultingly basic, and I'm sorry.
Anyway, the idea is that if a dataset is normally distributed, and all the observations are independent of one another, then 95% of the data points will fall within 1.96 standard deviations of the mean.
Do you get identical estimates of average power every time you measure, or are there some slight random differences from reading to reading? My guess is that it's the second. If you were to measure the power a whole bunch of times, and each time you plotted your average power value on a histogram, then that histogram of sample means would have the shape of a bell curve. This bell curve of sample means would have its own mean and standard deviation, and if you have thousands or millions of data points going into the calculation of each average power reading, it's not horrible to assume that it is a normal distribution. The explanation for this phenomenon is known as the 'central limit theorem', and I recommend both the Khan academy's presentation of it as well as the wikipedia page on it.
On the other hand, if your average power is the mean of some small number of data points, like for instance n= 5, or n= 30, then assumption of a normal distribution of sample means can be pretty bad. In this case, your 95% confidence interval around the average power goes from qt(0.975,n-1)*SD/sqrt(n) below the average to qt(0.975,n-1)*SD/sqrt(N) above the average, where qt(0.975,n-1) is the 97.5th percentile of the t distribution with n-1 degrees of freedom, and SD is your measured standard deviation.

A method to find the inconsistency or variation in the data

I am running an experiment (it's an image processing experiment) in which I have a set of paper samples and each sample has a set of lines. For each line in the paper sample, its strength is calculated which is denoted by say 's'. For a given paper sample I have to find the variation amongst the strength values 's'. If the variation is above a certain limit, we have to discard that paper.
1) I started with the Standard Deviation of the values, but the problem I am facing is that for each sample, order of magnitude for s (because of various properties of line like its length, sharpness, darkness etc) might differ and also the calculated Standard Deviations values are also differing a lot in magnitude. So I can't really use this method for different samples.
Is there any way where I can find that suitable limit which can be applicable for all samples.
I am thinking that since I don't have any history of how the strength value should behave,( for a given sample depending on the order of magnitude of the strength value more variation could be tolerated in that sample whereas because the magnitude is less in another sample, there should be less variation in that sample) I first need to find a way of baselining the variation in different samples. I don't know what approaches I could try to get started.
Please note that I have to tell variation between lines within a sample whereas the limit should be applicable for any good sample.
Please help me out.
You seem to have a set of samples. Then, for each sample you want to do two things: 1) compute a descriptive metric and 2) perform outlier detection. Both of these are vast subjects that require some knowledge of the phenomenology and statistics of the underlying problem. However, below are some ideas to get you going.
Compute a metric
Median Absolute Deviation. If your sample strength s has values that can jump by an order of magnitude across a sample then it is understandable that the standard deviation was not a good metric. The standard deviation is notoriously sensitive to outliers. So, try a more robust estimate of dispersion in your data. For example, the MAD estimate uses the median in the underlying computations which is more robust to a large spread in the numbers.
Robust measures of scale. Read up on other robust measures like the Interquartile range.
Perform outlier detection
Thresholding. This is similar to what you are already doing. However, you have to choose a suitable threshold for the metric computed above. You might consider using another robust metric for thresholding the metric. You can compute a robust estimate of their mean (e.g., the median) and a robust estimate of their standard deviation (e.g., 1.4826 * MAD). Then identify outliers as metric values above some number of robust standard deviations above the robust mean.
Histogram Another simple method is to histogram your computed metrics from step #1. This is non-parametric so it doesn't require you to model your data. If can histogram your metric values and then use the top 1% (or some other value) as your threshold limit.
Triangle Method A neat and simple heuristic for thresholding is the triangle method to perform binary classification of a skewed distribution.
Anomaly detection Read up on other outlier detection methods.

Resources