Randomly select increasing subset of data to see where mean levels off - subset

Could anyone please advise the best way to do the following?
I have three variables (X, Y & Z) and four groups (1, 2, 3 & 4). I have been using discriminant function analysis in SPSS to predict group membership of known grouped data for use with future ungrouped data.
Ideally I would like to able to randomly sample an increasing number of a subset of the data to see how many observations are required to hit a desired correct classification percentage.
However, I understand this might be difficult. Therefore, I'm looking to to do this for the means.
For example, Lets say variable X has a mean of 141 for group 1. This mean might have been calculated from 2000 observations. However, it might be the case that the mean occurred at say 700 observations. I would like to be able to calculate at what number of observations/cases the mean levels of in my data. For example, perhaps starting at 10 observations and repeating this randomly say 50 or 100 times, then increasing to 20 observations....and so on.
I understand this is a form of monte carlo testing. I have access to SPSS 15, 17 and 18 and excel. I also have access to minitab 15 & 16 and amos17 and have downloaded "R" but im not familiar with these. My experience is with SPSS and excel. I have tried some syntax in SPSS Modified from this..http://pages.infinit.net/rlevesqu/Syntax/RandomSampling/Select2CasesFromEachGroup.txt but this would still be quite time consuming on my part to enter the subset number ect etc.
Hope some one can help.
Thanks for reading.
Andy

The text you linked to is a good start (you can also use the SAMPLE command in SPSS, but IMO the Raynald script you linked to is more flexible when you think about constructing the sample that way).
In pseudo-code, the process might look like;
do n for sample size (a to b)
loop 100 times
draw sample size n
compute (& save) statistics
Here is where SPSS's macro language comes into play (I think this document is a good introduction, plus you can examine other references on the SPSS tag wiki). Basically once you figure out how to draw the sample and compute the stats you want, you just need to figure out how to write a macro so you can loop through the process (and pass it the sample size parameter). I include the loop 100 times because you want to be able to make some type of estimate about the error associated with each sample size.
If you give an example of how you compute the statistics I may be able to give examples of how to make that into a macro function and loop through the desired number of times.

#Andy W
#Oliver
Thanks for your suggestions guys. Ive managed to find a work around using the following macro from.........http://www.spsstools.net/Syntax/Bootstrap/GetRandomSampleOfVariousSizeCalcStats.txt However, for this I need to copy and paste the variable data for a given group into a new data window. Thats not to much of a problem. To take this further would anyone know how: 1/ I could get other statistics recorded eg std error, std dev ect ect. 2/Use other analysis, ideally discriminant function analysis and record in a new data window the percentage of correct classificcations rather than having lots of output tables 3/not need to copy and paste variables for each group so I can just run the macro specifying n samples for x variable on group 1, 2, 3 & 4.
Thanks again.
DEFINE !sample(myvar !TOKENS(1)
/nbsampl !TOKENS(1)
/size !CMDEND).
* myvar = the variable of interest (here we want the mean of salary)
* nbsampl = number of samples.
* size = the size of each samples.
!LET !first='1'
!DO !ss !IN (!size)
!DO !count = 1 !TO !nbsampl.
GET FILE='c:\Program Files\SPSS\employee data.sav'.
COMPUTE draw=uniform(1).
SORT CASES BY draw.
N OF CASES !ss.
COMPUTE samplenb=!count.
COMPUTE ss=!ss.
AGGREGATE
/OUTFILE=*
/BREAK=samplenb
/!myvar = MEAN(!myvar) /ss=FIRST(ss).
!IF (!first !NE '1') !THEN
ADD FILES /FILE=* /FILE='c:\temp\sample.sav'.
!IFEND
SAVE OUTFILE='c:\temp\sample.sav'.
!LET !first='0'
!DOEND.
!DOEND.
VARIABLE LABEL ss 'Sample size'.
EXAMINE
VARIABLES=salary BY ss /PLOT=BOXPLOT/STATISTICS=NONE/NOTOTAL
/MISSING=REPORT.
!ENDDEFINE.
* ----------------END OF MACRO ----------------------------------------------.
* Call macro (parameters are number of samples (here 20) and sizes of sample (here 5, 10,15,30,50).
* Thus 20 samples of size 5.
* Thus 20 samples of size 10, etc.
!sample myvar=salary nbsampl=20 size= 5 10 15 30 50.

Related

How do I calculate confidence interval with only sample size and confidence level

I'm writing a program that lets users run simulates on a subset of data, and as part of this process, the program allows a user to specify what sample size they want based on confidence level and confidence interval. Assuming a p value of .5 to maximum sample size, and given that I know the population size, I can calculate the sample size. For example, if I have:
Population = 54213
Confidence Level = .95
Confidence Interval = 8
I get Sample Size 150. I use the formula outlined here:
https://www.surveysystem.com/sample-size-formula.htm
What I have been asked to do is reverse the process, so that confidence interval is calculated using a given sample size and confidence level (and I know the population). I'm having a horrible time trying to reverse this equation and was wondering if there is a formula. More importantly, does this seem like an intelligent thing to do? Because this seems like a weird request to me.
I should mention (just to be clear) that the CI is estimated for the mean, not the population. In that case, if we assume the population is normally distributed and that we know the population standard deviation SD, then the CI is estimated as
From this formula you would also get your formula, where you are estimating n.
If the population SD is not known then you need to replace the z-value with a t-value.

Descriptive statistics, percentiles

I am stuck in a statistics assignment, and would really appreciate some qualified help.
We have been given a data set and are then asked to find the 10% with the lowest rate of profit, in order to decide what Profit rate is the maximum in order to be considered for a program.
the data has:
Mean = 3,61
St. dev. = 8,38
I am thinking that i need to find the 10th percentile, and if i run the percentile function in excel it returns -4,71.
However I tried to run the numbers by hand using the z-score.
where z = -1,28
z=(x-μ)/σ
Solving for x
x= μ + z σ
x=3,61+(-1,28*8,38)=-7,116
My question is which of the two methods is the right one? if any at all.
I am thoroughly confused at this point, hope someone has the time to help.
Thank you
This is the assignment btw:
"The Danish government introduces a program for economic growth and will
help the 10 percent of the rms with the lowest rate of prot. What rate
of prot is the maximum in order to be considered for the program given
the mean and standard deviation found above and assuming that the data
is normally distributed?"
The excel formula is giving the actual, empirical 10th percentile value of your sample
If the data you have includes all possible instances of whatever you’re trying to measure, then go ahead and use that.
If you’re sampling from a population and your sample size is small, use a t distribution or increase your sample size. If your sample size is healthy and your data are normally distributed, use z scores.
Short story is the different outcomes suggest the data you’ve supplied are not normally distributed.

Excel - How to split data into train and test sets that are equally distributed

I've got a data set (in Excel) that I'm going to import into SAS to undertake some modelling.
I've got a method for randomly splitting my excel dataset (using the =RAND() function), but is there a way (at the splitting stage) to ensure the distribution of the samples is even (other than to keep randomly splitting and testing the distribution until it becomes acceptable)?
Otherwise, if this is best performed in SAS, what is the most efficient approach for testing the sample randomness?
The dataset contains 35 variables, with a mixture of binary, continuous and categorical variables.
In SAS, you can just use proc surveyselect to do this.
proc surveyselect data=sashelp.cars out=cars_out outall samprate=0.7;
run;
data train test;
set cars_out;
if selected then output test;
else output train;
run;
If there is a particular variable[s] you want to make sure the Train and Test sets are balanced on, you can use either strata or control depending on exactly what sort of thing you're talking about. control will simply make an approximate attempt to even things by the control variables (it sorts by the control variable, then pulls every 3rd or whatever, so you get a sort of approximate balance; if you have 2+ control variables it snake-sorts, Asc. then Desc. etc. inside, but that reduces randomness).
If you use strata, it guarantees you the sample rate inside the strata - so if you did:
proc sort data=sashelp.cars out=cars;
by origin;
run;
proc surveyselect data=cars out=cars_out outall samprate=0.7;
strata origin;
run;
(and the final splitting data step is the same) then you'd get 70% of each separate origin pulled (which would end up being 70% of the total, of course).
Which you do depends on what you care about it being balanced by. The more things you do this with, the less balanced it is with everything else, so be cautious; it may be that a simple random sample is the best, especially if you have a good enough N.
If you don't have enough N, then you can use bootstrapping techniques, meaning you take a sample WITH replacement from that 70% and take maybe 100 of those samples, each with a higher N than your original. Then you do your test or whatever on each sample selected, and the variation in those results tells you how you're doing even if your N is not enough to do it in one pass.
This answer has nothing to do with Excel, but with sampling strategy.
First we must construct a criteria that the sample's measure's are "close enough" to the complete dataset.
Say we are interested in the mean and the standard deviation and that the complete population is a set of 10,000 values in column A
we calculate the mean and standard deviation of the complete dataset.
devise a "close enough" criteria for each measure
pick, say, 500 samples
calculate the measures for the sample.
if the measures are "close enough" we are done, otherwise pick another 500.
We need to be careful that the criteria are not too tight; otherwise we may loop forever.

Working Out Big O Of Functions Using Java + Excel

I have been trying to get the big O of four different functions using java and excel. I have no idea what these functions are as they have been hidden. I am not sure if this is the right place / forum to ask.
I have got the functions to give various pieces of data using some java and put them into excel along with the steps (1-n). I then put them into graphs straight away using just the n and the arbitrary measure of time they took if the output was constantly the same. For example if n = 1 always equal 200 for every time its run. For the ones that varied each time the function was run I ran the function 10 times and did an average for each step.
After I had the data I created a graph for each one and put a trendline on it. My f(1) for example was best fitted to a polynomial trendline order 2, which I assume is Quadratic (n2) of big O?. But I needed to prove it was n2, so I did =Steps/LOG(N) which made it fit best to a polynomial trendline order 3, which I assume is Cubic (n3)? (Is that right?)
I really have no idea what to do next to 'prove' that this function is Quadratic or Cubic or how to prove its best case / worst case.
So basically I am trying to work out what the missing step is.
Computation
Graph
Trendline
??? - Proof that the function has big O(?)
When you say "if n=1 always equal 200" does that mean if n=1 it takes 200 steps to run? If that's the case this function would be 200n and this O(n).
I think to solve this you should call each function on different values (I'd start with 10, 20, 30 ... ect) up to some high number. Capture these values and plot them in Excel. Then use the built in trend line function. This should give you a rough estimate of what the run time is. From there you should be able to get the Big-O.

Looking for an estimation method (data analysis)

Since I have no idea about what I am doing right now, my wording may sound funny. But seriously, I need to learn.
The problem I'm facing is to come up with a method (model) to estimate how a software program works: namely running time and maximal memory usage. What I already have are a large amount of data. This data set gives an overview of how a program works under different conditions, e.g.
<code>
RUN Criterion_A Criterion_B Criterion_C Criterion_D Criterion_E <br>
------------------------------------------------------------------------
R0001 12 2 3556 27 9 <br>
R0002 2 5 2154 22 8 <br>
R0003 19 12 5556 37 9 <br>
R0004 10 3 1556 7 9 <br>
R0005 5 1 556 17 8 <br>
</code>
I have thousands of rows of such data. Now I need to know how I can estimate (forecast) the running time and maximal memory usage if I know all criteria in advance. What I need is an approximation that gives hints (upper limits, or ranges).
I have feeling that it is a typical ??? problem which I don't know. Could you guys show me some hints or give me some ideas (theories, explanations, webpages) or anything that may help. Thanks!
You want a new program that takes as input one or more criteria, then outputs an estimate of the running time or memory usage. This is a machine learning problem.
Your inputs can be listed as a vector of numbers, like this:
input = [ A, B, C, D, E ]
One of the simplest algorithms for this would be a K-nearest neighbor algorithm. The idea behind this is that you'll take your input vector of numbers, and find in your database the vector of numbers that is most similar to your input vector. For example, given this vector of inputs:
input = [ 11, 1.8, 3557, 29, 10 ]
You can assume that the running time and memory should be very similar to the values from this run (originally in your table listed above):
R0001 12 2 3556 27 9
There are several algorithms for calculating the similarity between these two vectors, one simple and intuitive such algorithm is the Euclidean distance. As an example, the Euclidean distance between the input vector and the vector from the table is this:
dist = sqrt( (11-12)^2 + (1.8-2)^2 + (3557-3556)^2 + (27-29)^2 + (9-10)^2 )
dist = 2.6533
It should be intuitively clear that points with lower distance should be better estimates for running time and memory usage, as the distance should describe the similarity between two sets of criteria. Assuming your criteria are informative and well-selected, points with similar criteria should have similar running time and memory usage.
Here's some example code of how to do this in R:
r1 = c(11,1.8,3557,29,10)
r2 = c(12,2.0,3556,27, 9)
print(r1)
print(r2)
dist_r1_r2 = sqrt( (11-12)^2 + (1.8-2)^2 + (3557-3556)^2 + (27-29)^2 + (9-10)^2 )
print(dist_r1_r2)
smarter_dist_r1_r2 = sqrt( sum( (r1 - r2)^2 ) )
print(smarter_dist_r1_r2)
Taking the running time and memory usage of your nearest row is the KNN algorithm for K=1. This approach can be extended to include data from multiple rows by taking a weighted combination of multiple rows from the database, with rows with lower distances to your input vector contributing more to the estimates. Read the Wikipedia page on KNN for more information, especially with regard to data normalization, including contributions from multiple points, and computing distances.
When calculating the difference between these lists of input vectors, you should consider normalizing your data. The rationale for doing this is that a difference of 1 unit between 3557 and 3556 for criteria C may not be equivalent to a difference of 1 between 11 and 12 for criteria A. If your data are normally distributed, you can convert them all to standard scores (or Z-scores) using this formula:
N_trans = (N - mean(N)) / sdev(N)
There is no general solution on the "right" way to normalize data as it depends on the type and range of data that you have, but Z-scores are easy to compute and a good method to try first.
There are many more sophisticated techniques for constructing estimates such as this, including linear regression, support vector regression, and non-linear modeling. The idea behind some of the more sophisticated methods is that you try and develop an equation that describes the relationship of your variables to running time or memory. For example, a simple application might just have one criterion and you can try and distinguish between models such as:
running_time = s1 * A + s0
running_time = s2 * A^2 + s1 * A + s0
running_time = s3 * log(A) + s2 * A^2 + s1 * A + s0
The idea is that A is your fixed criteria, and sN are a list of free parameters that you can tweak until you get a model that works well.
One problem with this approach is that there are many different possible models that have different numbers of parameters. Distinguishing between models that have different numbers of parameters is a difficult problem in statistics, and I don't recommend tackling it during your first foray into machine learning.
Some questions that you should ask yourself are:
Do all of my criteria affect both running time and memory usage? Do some affect only one or the other, and are some useless from a predictive point of view? Answering this question is called feature selection, and is an outstanding problem in machine learning.
Do you have any a priori estimates of how your variables should influence running time or memory usage? For example, you might know that your application uses a sorting algorithm that is N * log(N) in time, which means that you explicitly know the relationship between one criterion and your running time.
Do your rows of measured input criteria paired with running time and memory usage cover all of the plausible use cases for your application? If so, then your estimates will be much better, as machine learning can have a difficult time with data that it's unfamiliar with.
Do the running time and memory of your program depend on criteria that you don't input into your estimation strategy? For example, if you're depending on an external resource such as a web spider, problems with your network may influence running time and memory usage in ways that are difficult to predict. If this is the case, your estimates will have a lot more variance.
If the criterion you would be forecasting for lies within the range of currently known criteria then you should do some more research on the Interpolation process:
In the mathematical subfield of numerical analysis, interpolation is a method of constructing new data points within the range of a discrete set of known data points
If it lies outside your currently known data range research Extrapolation which is less accurate:
In mathematics, extrapolation is the process of constructing new data points outside a discrete set of known data points.
Methods
Interpolation methods for your browsing.
A powerpoint presentation detailing some methods used for Extrapolation.

Resources