sample size calculation between 3 groups - sample-size

I would like to calculate a sample size for 3 groups based on the mean for each of the three groups of an continuous variable (Flow per min).
Its a randomised study with an allocation ratio of 1:1:1. From the litterateur i know the following means:
Group1 =30, group2 =39, group3= 35
I would like the significance level = 0.05 and set the power to 0.80.
So far i tried to use the pwr package, and i end up with a sample size of 52 for each group.
But here i just use the cohen recommend middel effect sixe F= 0.25 and i would rather that the calculation was based on the means from the literature.
All help very much appreciated.
´´´´
pwr.anova.test(k=3,f=.25,sig.level=.05,power=.8)
´´´´

Related

How do I calculate age mean & standard deviation using aggregate age?

My data set has an age range variable, but I would like to calculate the mean and standard deviation of age.
Since your data is categorical, there isn't a way to calculate the "true" sample mean and standard deviation of respondent age. There are a few different ways you could estimate, depending on how sophisticated you'd like to get.
The simplest way would be to assign an age to each band (say, the mid-point) and summarize on that. The downside is that you will be underestimating the standard deviation (clumping data together tends to do that). To the extent your categories are not uniformly distributed (and from your image they don't appear to be), your estimate of the mean will also be off.
* set point estimates for each age band .
RECODE age (1=22) (2=30) (3=40) (4=50) (5=60) (6=70) (7=80) .
EXE .
* calculate mean and std dev .
MEANS age /CELLS MEAN STDDEV .
More sophisticated estimation techniques might try to account for skews in data (e.g. your sample seems to skew younger) and convert each age band into its own distribution.
For example, instead of assuming 203 respondents are age 22 (as is done in the code above), you might assume 25 respondents each are 18, 19, 20, ... 25. More realistically than that even, you might assume that even that distribution skews younger (e.g. 50 18-yr olds, 40 19-yr old, etc etc).
Automated approaches to that would be interesting as its own question. :)

Get minimum period of non-equidistant sigal

I have non-equidistant timestamps and according values like
sample_timestamp powerdemand_in_kw_avg_sum
0 1.539009e+09 2.164672e+01
1 1.539009e+09 3.483988e+01
2 1.539010e+09 1.319316e+01
3 1.539014e+09 1.818989e-15
4 1.539021e+09 2.061695e+00
[...]
I would like to transform it to an equidistant signal. According to Nyquist–Shannon sampling theorem I should choose the sampling frequency smaller than half the minimum period. How can I get the minimum period (using Python)?
Sorry if there is some technical incorrecness, I am new to telecommunications.
To get the minimum difference between two timestamps, you can use .shift method
(df['sample_timestamp'] - df['sample_timestamp'].shift(1)).min()
I'm not an expert in telecommunications, the rest is up to you.

How do I calculate confidence interval with only sample size and confidence level

I'm writing a program that lets users run simulates on a subset of data, and as part of this process, the program allows a user to specify what sample size they want based on confidence level and confidence interval. Assuming a p value of .5 to maximum sample size, and given that I know the population size, I can calculate the sample size. For example, if I have:
Population = 54213
Confidence Level = .95
Confidence Interval = 8
I get Sample Size 150. I use the formula outlined here:
https://www.surveysystem.com/sample-size-formula.htm
What I have been asked to do is reverse the process, so that confidence interval is calculated using a given sample size and confidence level (and I know the population). I'm having a horrible time trying to reverse this equation and was wondering if there is a formula. More importantly, does this seem like an intelligent thing to do? Because this seems like a weird request to me.
I should mention (just to be clear) that the CI is estimated for the mean, not the population. In that case, if we assume the population is normally distributed and that we know the population standard deviation SD, then the CI is estimated as
From this formula you would also get your formula, where you are estimating n.
If the population SD is not known then you need to replace the z-value with a t-value.

How to compare means of two sets when one set is a subset of another and the sample sizes are not

I have two sets containing citation counts for some publications. Of those sets one is a subset of the another. That is, subset contains some exact citation counts appearing on the other set. e.g.
Set1 Set2 (Subset)
50 50
24 24
12 -
5 5
4 4
43 43
2 -
2 -
1 -
1 -
So I want to decide if the numbers from the subset are good enough to represent set1? On this matter:
I have intended to apply student t-test but i could not be sure how
to apply it. The reason is that the sets are dependent so I could
not apply unpaired t-test requiring both sets must come from
independent populations. On the other hand, paired t-test also does
not look suitable since sample sizes must be equal.
In case of an outlier should I remove it? To me it is not logical
since it is not normally an outlier but a publication is cited quite a
lot so it belongs to the same sample. How to deal with such cases?
If I do not remove it, it causes the variance to be too big
affecting statistical tests...Is it a good idea to replace it with
median instead of mean since citation distributions generally tend
to be highly skewed?
How could I remedy this issue?

Randomly select increasing subset of data to see where mean levels off

Could anyone please advise the best way to do the following?
I have three variables (X, Y & Z) and four groups (1, 2, 3 & 4). I have been using discriminant function analysis in SPSS to predict group membership of known grouped data for use with future ungrouped data.
Ideally I would like to able to randomly sample an increasing number of a subset of the data to see how many observations are required to hit a desired correct classification percentage.
However, I understand this might be difficult. Therefore, I'm looking to to do this for the means.
For example, Lets say variable X has a mean of 141 for group 1. This mean might have been calculated from 2000 observations. However, it might be the case that the mean occurred at say 700 observations. I would like to be able to calculate at what number of observations/cases the mean levels of in my data. For example, perhaps starting at 10 observations and repeating this randomly say 50 or 100 times, then increasing to 20 observations....and so on.
I understand this is a form of monte carlo testing. I have access to SPSS 15, 17 and 18 and excel. I also have access to minitab 15 & 16 and amos17 and have downloaded "R" but im not familiar with these. My experience is with SPSS and excel. I have tried some syntax in SPSS Modified from this..http://pages.infinit.net/rlevesqu/Syntax/RandomSampling/Select2CasesFromEachGroup.txt but this would still be quite time consuming on my part to enter the subset number ect etc.
Hope some one can help.
Thanks for reading.
Andy
The text you linked to is a good start (you can also use the SAMPLE command in SPSS, but IMO the Raynald script you linked to is more flexible when you think about constructing the sample that way).
In pseudo-code, the process might look like;
do n for sample size (a to b)
loop 100 times
draw sample size n
compute (& save) statistics
Here is where SPSS's macro language comes into play (I think this document is a good introduction, plus you can examine other references on the SPSS tag wiki). Basically once you figure out how to draw the sample and compute the stats you want, you just need to figure out how to write a macro so you can loop through the process (and pass it the sample size parameter). I include the loop 100 times because you want to be able to make some type of estimate about the error associated with each sample size.
If you give an example of how you compute the statistics I may be able to give examples of how to make that into a macro function and loop through the desired number of times.
#Andy W
#Oliver
Thanks for your suggestions guys. Ive managed to find a work around using the following macro from.........http://www.spsstools.net/Syntax/Bootstrap/GetRandomSampleOfVariousSizeCalcStats.txt However, for this I need to copy and paste the variable data for a given group into a new data window. Thats not to much of a problem. To take this further would anyone know how: 1/ I could get other statistics recorded eg std error, std dev ect ect. 2/Use other analysis, ideally discriminant function analysis and record in a new data window the percentage of correct classificcations rather than having lots of output tables 3/not need to copy and paste variables for each group so I can just run the macro specifying n samples for x variable on group 1, 2, 3 & 4.
Thanks again.
DEFINE !sample(myvar !TOKENS(1)
/nbsampl !TOKENS(1)
/size !CMDEND).
* myvar = the variable of interest (here we want the mean of salary)
* nbsampl = number of samples.
* size = the size of each samples.
!LET !first='1'
!DO !ss !IN (!size)
!DO !count = 1 !TO !nbsampl.
GET FILE='c:\Program Files\SPSS\employee data.sav'.
COMPUTE draw=uniform(1).
SORT CASES BY draw.
N OF CASES !ss.
COMPUTE samplenb=!count.
COMPUTE ss=!ss.
AGGREGATE
/OUTFILE=*
/BREAK=samplenb
/!myvar = MEAN(!myvar) /ss=FIRST(ss).
!IF (!first !NE '1') !THEN
ADD FILES /FILE=* /FILE='c:\temp\sample.sav'.
!IFEND
SAVE OUTFILE='c:\temp\sample.sav'.
!LET !first='0'
!DOEND.
!DOEND.
VARIABLE LABEL ss 'Sample size'.
EXAMINE
VARIABLES=salary BY ss /PLOT=BOXPLOT/STATISTICS=NONE/NOTOTAL
/MISSING=REPORT.
!ENDDEFINE.
* ----------------END OF MACRO ----------------------------------------------.
* Call macro (parameters are number of samples (here 20) and sizes of sample (here 5, 10,15,30,50).
* Thus 20 samples of size 5.
* Thus 20 samples of size 10, etc.
!sample myvar=salary nbsampl=20 size= 5 10 15 30 50.

Resources