Calculating confidence interval and sample size for data conversions - statistics

Say we are converting data from one bookstore application to another. How would one going about calculating the sample size of books to review after the data conversion to be sure 90+-5% of all books converted correctly?
Say our existing book list contains 30 books. How many books would we have to review in the new application after the data conversion to be 85-95% certain that all books converted correctly?

Ok, let's assume the variable X=Proportion of books converted correctly, distributed normally, with values between 0 and 1
Sample size = this is what we want to determine
Population size = 30
Existing book list contains 30 books
Estimated value = 0.90
That is, the value of X that you think is real.
90+-5% of all books converted correctly
If you have no idea of what's the actual value, use 0.5 instead
Error margin = 0.05
The difference between the real value and the estimated value. As you ascertained above, this would be +-5%
Confidence level = 0.95
This is NOT the same as error margin. You are making a prediction, how sure do you want to be of your prediction? This is the confidence level. You gave two values above:
to be 85-95% certain that all books converted correctly
So we're going with 95%, just to be sure.
The recommended sample size is 25
You can use this calculator to arrive to the same results
https://select-statistics.co.uk/calculators/sample-size-calculator-population-proportion/
And it also has a magnificent explanation of all the input values above.
Hope it works for you. Cheers!

Related

Calculate risk using Cox model coefficients and mean values

I'm trying to understand the example presented in Appendix C here
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6481149/
Equation C1 is clear to me.
But in Equation C2 they use the mean values.
Such mean values are clear to me in the case of categorical variables for example 1.548 is the mean value of the Sex variable (as shown in the Table 3). Please correct me if I'm wrong.
But in numerical variables I don't understand which mean values are they using. For example for the Age variable they use 3.768, if I understand right, that value is the log of the mean age, should be log(44.15)=1.64. Instead the used value is 3.768.
Please could anybody clarify where does this value come from?
In statistics log often means the natural logarithm, sometimes denoted ln. The four values they take the logarithms of are:
Variable
Reported Mean
ln(Mean)
Reported
Age
44.15
3.788
3.768
BMI
25.61
3.243
3.230
BP Syst
138.6
4.932
4.913
Pulse Rate
75.61
4.326
4.311
The calculated values are not exactly equal to the reported values. But it looks close enough that this is probably the calculation they used. Without the data and/or code they used it's hard to say why the results are different. The study mentions excluding 130 participants because of ethics protections. So, perhaps one table was calculated using a slightly different group of participants than the other table?

Small data anomaly detection algo

I have the following 3 cases of a numeric metric on a time series(t,t1,t2 etc denotes different hourly comparisons across periods)
If you notice the 3 graphs t(period of interest) clearly has a drop off for image 1 but not so much for image 2 and image 3. Assume this is some sort of numeric metric(raw metric or derived) and I want to create a system/algo which specifically catches case 1 but not case 2 or 3 with t being the point of interest. While visually this makes sense and is very intuitive I am trying to design a way to this in python using the dataframes shown in the picture.
Generally the problem is how do I detect when the time series is behaving very differently from any of the prior weeks.
Edit: When I say different what I really mean is, my metric trends together across periods in t1 to t4 but if they dont and try to separate out of the envelope, that to me is an anomaly. If you notice chart 1 you can see t tries to split out from rest of the tn this is an anomaly for me. in other cases t is within the bounds of other time periods. Hope this helps.
With small data the best is if you can come up with a good transformation into a simpler representation.
In this case I would try the following:
Distance to the median along the time-axis. Then a summary of that, could be median, Mean-Squared-Error etc
Median of the cross-correlation of the signals

Creating independent variable Stata in Panel Data model

Data and description of variables
Picture 1 and Sample unbalanced paneldata
Picture 1 shows a balanced panel data that I have created using an unbalanced one provided as a sample in the same image, where I had multiple products (ID) for different amount of years (YEAR). For each product, there were a different number of Shops offering the given product (ID). So as stated, this is a balanced set created by sorting out for the same years, same products (ID), and same shops (marked by the orange area in the sample unbalanced paneldata). This is an important assumption that might affect the perception of the issue stated below. The following is therefore a description of the table shown in Picture 1:
Years indicates the amount of period a product lasts for a given product (ID)
Shop 1, Shop 2, Shop 3 indicate different prices for a given product (ID) by different firms
The minimum and second minimum value depict what shops for a given year and product (ID), have the lowest and second lowest price for that given year. This is needed to calculate the Price difference, which is **(Second minimum value - Minimum Value) / (Minimum Value)
An example of this, is given for row 5 (Year 01.01.1995 - ID 101) where Price difference would be (3999-3790)/3790 = 5,51% (In Picture 1)
Issue
In my balanced panel data, (Picture 1), I want to run a fixed effect regression in STATA using xtreg function, where the dependent variable is the Price difference, and number of shops selling a product are the independent variables. This is, so I can say how Price difference as a dependent variable is affected when there is 1 shop selling, when there are two shops selling, and when there are three shops selling.
Another problem is, is my assumption valid at all of creating a balanced panel? Is it correct to create a balanced from the unbalanced paneldata, or must I use the unbalanced panel to create such a variable?
So my main issue is how to create such independent variables, that measure the dimension of number of shops offering products. To
clarify what I mean, I have included an example of a sample fixed
effect regression that may explain the structure that I attempt to
seek, in Picture 2 below:
NOTE (In picture 2 expected cell mean to the right is the same as Price difference in Picture 1, and is used as dependent variable. They are regressed on number of firms/shops as independent variables, and these I have an issue creating)
Picture 2
What I have tried
I have tried, using dummy variables, on shops, but they ended up getting dropped. The dataset provided in picture 1 is a balanced data set as mentioned, which is needed to run (I assume) a fixed effect regression on a paneldata.
End remark
I stated this question earlier in a much more imprecise manner, where I apologiese for any inconvenience. The problem I think, might be that either I have set it up wrong in excel, hence the dummy's are dropped, or something of that nature. It might also be, that I have to use the unbalanced set in order to create this independent variable, so that might also be a problem, that I am attempting to use a balanced set instead of the unbalanced one.
In your unbalanced sample (as we discussed in the comments, the balanced sample will not make sense) we first need to create a variable for the number of shops offering each ID, let us say we have the same data as in the top portion of your Picture 1
egen number_of_firms = rownonmiss(Shop*)
xtset ID year // to use xtreg, we must tell Stata the data are panel
xtreg Price_difference i.number_of_firms
The xtreg is the regression shown in your Picture 2.
If you want the number of firms variable to be formatted a bit more like Picture 2, you can do something like this:
qui levelsof number_of_firms, local(num)
foreach n in `num' {
local lab_def `lab_def' `n' "`n' Firms"
}
label def num_firms `lab_def'
label values number_of_firms num_firms
label var number_of_firms "Number of Firms"
And then run the regression and the output will be formatted with the number of firms lables.

How to generate random numbers within a normal distribution using Excel

I want to use the RAND() function in Excel to generate a random number between 0 and 1.
However, I would like 80% of the values to fall between 0 and 0.2, 90% of the values to fall between 0 and 0.3, 95% of the values to fall between 0 and 0.5, etc.
This reminds me that I took an applied statistics course once upon a time, but not of what was actually in the course...
How is the best way to go about achieving this result using an Excel formula. Alternatively, what is this kind of statistical calculation called / any other pointers that I can Google around for.
=================
Use case:
I have a single column of meter readings, which I would like to duplicate 7 times (each column for a new month). each column has 55 000 rows. While the meter readings need to vary for each month, when taken as a time series, each meter number should have 7 realistic readings.
The aim is to produce realistic data to turn into heat maps (i.e. flag outlying meter readings)
I don't think that there is a formula which would fit exactly to your requirements. I would use a very straightforward solution:
Generate 80% of data using =RANDBETWEEN(0,20)/100
Generate 10% of data using =RANDBETWEEN(20,30)/100
Generate 5% of data using =RANDBETWEEN(30,50)/100
and so on
You can easily change the precision of generated data by modifying the parameters, for example: =RANDBETWEEN(0,2000)/10000 will generate data with up to 4 digits after decimal point.
UPDATE
Use a normal distribution for the use case, for example:
=NORMINV(RAND(), 20, 5)
where 20 is a mean value and 5 is a standard deviation.

Compute statistical significance with Excel

I have 2 columns and multiple rows of data in excel. Each column represents an algorithm and the values in rows are the results of these algorithms with different parameters. I want to make statistical significance test of these two algorithms with excel. Can anyone suggest a function?
As a result, it will be nice to state something like "Algorithm A performs 8% better than Algorithm B with .9 probability (or 95% confidence interval)"
The wikipedia article explains accurately what I need:
http://en.wikipedia.org/wiki/Statistical_significance
It seems like a very easy task but I failed to find a scientific measurement function.
Any advice over a built-in function of excel or function snippets are appreciated.
Thanks..
Edit:
After tharkun's comments, I realized I should clarify some points:
The results are merely real numbers between 1-100 (they are percentage values). As each row represents a different parameter, values in a row represents an algorithm's result for this parameter. The results do not depend on each other.
When I take average of all values for Algorithm A and Algorithm B, I see that the mean of all results that Algorithm A produced are 10% higher than Algorithm B's. But I don't know if this is statistically significant or not. In other words, maybe for one parameter Algorithm A scored 100 percent higher than Algorithm B and for the rest Algorithm B has higher scores but just because of this one result, the difference in average is 10%.
And I want to do this calculation using just excel.
Thanks for the clarification. In that case you want to do an independent sample T-Test. Meaning you want to compare the means of two independent data sets.
Excel has a function TTEST, that's what you need.
For your example you should probably use two tails and type 2.
The formula will output a probability value known as probability of alpha error. This is the error which you would make if you assumed the two datasets are different but they aren't. The lower the alpha error probability the higher the chance your sets are different.
You should only accept the difference of the two datasets if the value is lower than 0.01 (1%) or for critical outcomes even 0.001 or lower. You should also know that in the t-test needs at least around 30 values per dataset to be reliable enough and that the type 2 test assumes equal variances of the two datasets. If equal variances are not given, you should use the type 3 test.
http://depts.alverno.edu/nsmt/stats.htm

Resources