Descriptive statistics, percentiles - excel

I am stuck in a statistics assignment, and would really appreciate some qualified help.
We have been given a data set and are then asked to find the 10% with the lowest rate of profit, in order to decide what Profit rate is the maximum in order to be considered for a program.
the data has:
Mean = 3,61
St. dev. = 8,38
I am thinking that i need to find the 10th percentile, and if i run the percentile function in excel it returns -4,71.
However I tried to run the numbers by hand using the z-score.
where z = -1,28
z=(x-μ)/σ
Solving for x
x= μ + z σ
x=3,61+(-1,28*8,38)=-7,116
My question is which of the two methods is the right one? if any at all.
I am thoroughly confused at this point, hope someone has the time to help.
Thank you
This is the assignment btw:
"The Danish government introduces a program for economic growth and will
help the 10 percent of the rms with the lowest rate of prot. What rate
of prot is the maximum in order to be considered for the program given
the mean and standard deviation found above and assuming that the data
is normally distributed?"

The excel formula is giving the actual, empirical 10th percentile value of your sample
If the data you have includes all possible instances of whatever you’re trying to measure, then go ahead and use that.
If you’re sampling from a population and your sample size is small, use a t distribution or increase your sample size. If your sample size is healthy and your data are normally distributed, use z scores.
Short story is the different outcomes suggest the data you’ve supplied are not normally distributed.

Related

sample size for a single arm study based on median time to event

In my master thesis, I need to determine and calculate the number of cases for median time to event. The method is according to Brookmeyer & Crowley, 1982. My question is: How can I determine the sample size according to Brookmeyer? So determine the number of cases for median time to event. How can I define the equation for N? I know how to calculate the confidence interval, but my problem, how do I determine the case number theoretically for this.
Edit:
"Designing the trial with different characteristics: planning a single arm study without historical control. How can I determine the sample size N and what method is the best", this is my plan. Assuming "Median Time to event "PFS" ". I want to determine the sample size N and then calculate it, that's why I thought that I can clearly use or find a formula for N. I firmly assume that the survival time is exponentially distributed I want to see with it: 1- Sample size based on distributional assumptions? 2- No implementation available? How to derive p-value? Thanks for further help, best regards

How do I calculate confidence interval with only sample size and confidence level

I'm writing a program that lets users run simulates on a subset of data, and as part of this process, the program allows a user to specify what sample size they want based on confidence level and confidence interval. Assuming a p value of .5 to maximum sample size, and given that I know the population size, I can calculate the sample size. For example, if I have:
Population = 54213
Confidence Level = .95
Confidence Interval = 8
I get Sample Size 150. I use the formula outlined here:
https://www.surveysystem.com/sample-size-formula.htm
What I have been asked to do is reverse the process, so that confidence interval is calculated using a given sample size and confidence level (and I know the population). I'm having a horrible time trying to reverse this equation and was wondering if there is a formula. More importantly, does this seem like an intelligent thing to do? Because this seems like a weird request to me.
I should mention (just to be clear) that the CI is estimated for the mean, not the population. In that case, if we assume the population is normally distributed and that we know the population standard deviation SD, then the CI is estimated as
From this formula you would also get your formula, where you are estimating n.
If the population SD is not known then you need to replace the z-value with a t-value.

Verify transmit power to be within certain limits of its expected value over 95% of test measurements

I have a requirement where I have to verify the transmit power out of a device as measured at its connector is within 2 dB of its expected value over 95% of test measurements.
I am using a signal analyzer to analyze the transmitted power. I only get the average power value, min, max and stdDev of the measurements and not the individual power measurements.
Now, the question is how would I verify the "95% thing" using average power, min, max and stdDev. It seems that I can use normal distribution to find the 95% confidence level.
I would appreciate if someone can help me on this.
Thanks in anticipation
The way I'm reading this, it seems you are a statistical beginner, so if I'm wrong there, the rest of this answer will probably be insultingly basic, and I'm sorry.
Anyway, the idea is that if a dataset is normally distributed, and all the observations are independent of one another, then 95% of the data points will fall within 1.96 standard deviations of the mean.
Do you get identical estimates of average power every time you measure, or are there some slight random differences from reading to reading? My guess is that it's the second. If you were to measure the power a whole bunch of times, and each time you plotted your average power value on a histogram, then that histogram of sample means would have the shape of a bell curve. This bell curve of sample means would have its own mean and standard deviation, and if you have thousands or millions of data points going into the calculation of each average power reading, it's not horrible to assume that it is a normal distribution. The explanation for this phenomenon is known as the 'central limit theorem', and I recommend both the Khan academy's presentation of it as well as the wikipedia page on it.
On the other hand, if your average power is the mean of some small number of data points, like for instance n= 5, or n= 30, then assumption of a normal distribution of sample means can be pretty bad. In this case, your 95% confidence interval around the average power goes from qt(0.975,n-1)*SD/sqrt(n) below the average to qt(0.975,n-1)*SD/sqrt(N) above the average, where qt(0.975,n-1) is the 97.5th percentile of the t distribution with n-1 degrees of freedom, and SD is your measured standard deviation.

Integrating Power pdf to get energy pdf?

I'm trying to work out how to solve what seems like a simple problem, but I can't convince myself of the correct method.
I have time-series data that represents the pdf of a Power output (P), varying over time, also the cdf and quantile functions - f(P,t), F(P,t) and q(p,t). I need to find the pdf, cdf and quantile function for the Energy in a given time interval [t1,t2] from this data - say e(), E(), and qe().
Clearly energy is the integral of the power over [t1,t2], but how do I best calculate e, E and qe ?
My best guess is that since q(p,t) is a power, I should generate qe by integrating q over the time interval, and then calculate the other distributions from that.
Is it as simple as that, or do I need to get to grips with stochastic calculus ?
Additional details for clarification
The data we're getting is a time-series of 'black-box' forecasts for f(P), F(P),q(P) for each time t, where P is the instantaneous power and there will be around 100 forecasts for the interval I'd like to get the e(P) for. By 'Black-box' I mean that there will be a function I can call to evaluate f,F,q for P, but I don't know the underlying distribution.
The black-box functions are almost certainly interpolating output data from the model that produces the power forecasts, but we don't have access to that. I would guess that it won't be anything straightforward, since it comes from a chain of non-linear transformations. It's actually wind farm production forecasts: the wind speeds may be normally distributed, but multiple terrain and turbine transformations will change that.
Further clarification
(I've edited the original text to remove confusing variable names in the energy distribution functions.)
The forecasts will be provided as follows:
The interval [t1,t2] that we need e, E and qe for is sub-divided into 100 (say) sub-intervals k=1...100. For each k we are given a distinct f(P), call them f_k(P). We need to calculate the energy distributions for the interval from this set of f_k(P).
Thanks for the clarification. From what I can tell, you don't have enough information to solve this problem properly. Specifically, you need to have some estimate of the dependence of power from one time step to the next. The longer the time step, the less the dependence; if the steps are long enough, power might be approximately independent from one step to the next, which would be good news because that would simplify the analysis quite a bit. So, how long are the time steps? An hour? A minute? A day?
If the time steps are long enough to be independent, the distribution of energy is the distribution of 100 variables, which will be very nearly normally distributed by the central limit theorem. It's easy to work out the mean and variance of the total energy in this case.
Otherwise, the distribution will be some more complicated result. My guess is that the variance as estimated by the independent-steps approach will be too big -- the actual variance would be somewhat less, I believe.
From what you say, you don't have any information about temporal dependence. Maybe you can find or derive from some other source or sources an estimate the autocorrelation function -- I wouldn't be surprised if that question has already been studied for wind power. I also wouldn't be surprised if a general version of this problem has already been studied -- perhaps you can search for something like "distribution of a sum of autocorrelated variables." You might get some interest in that question on stats.stackexchange.com.

Obtaining the Standard Error of Weighted Data in SPSS

I'm trying to find confidence intervals for the means of various variables in a database using SPSS, and I've run into a spot of trouble.
The data is weighted, because each of the people who was surveyed represents a different portion of the overall population. For example, one young man in our sample might represent 28000 young men in the general population. The problem is that SPSS seems to think that the young man's database entries each represent 28000 measurements when they actually just represent one, and this makes SPSS think we have much more data than we actually do. As a result SPSS is giving very very low standard error estimates and very very narrow confidence intervals.
I've tried fixing this by dividing every weight value by the mean weight. This gives plausible figures and an average weight of 1, but I'm not sure the resulting numbers are actually correct.
Is my approach sound? If not, what should I try?
I've been using the Explore command to find mean and standard error (among other things), in case it matters.
You do need to scale weights to the actual sample size, but only the procedures in the Complex Samples option are designed to account for sampling weights properly. The regular weight variable in Statistics is treated as a frequency weight.

Resources