I am trying to compute the critical values for the two-sided Kolmogorov-Smirnov test (PROC NPAR1WAY doesn't output these!). This is calculated as c(a) * sqrt( (n+m)/(nm) ) where n and m are the number of observations in each dataset and c(a) = 1.36 for for confidence level a = 0.05.
Either,
A) is there a routine in SAS that will calculate these for me? (I've been searching a while) or,
B) what is the best way to compute the statistic myself? My initial approach is to select the number of rows from each data set into macro variables then compute the statistic, but this feel ugly.
Thanks in advance
A) Probably not, if you've searched all the relevant documentation.
B) That method sounds fine, but you can use a data step if you prefer, e.g.
data example1 example2;
set sashelp.class;
if _n_ < 6 then output example1;
else output example2;
run;
data _null_;
if 0 then set example1 nobs = n;
if 0 then set example2 nobs = m;
call symput('Kolmogorov_Smirnov_05',1.36 * sqrt((n+m)/(n*m)));
run;
%put &=Kolmogorov_Smirnov_05;
Related
I used a proc reg to come up with my prediction intervals:
PROC REG data = steers;
model y = x / cli;
run;
Here's my output for the proc reg
I need to construct a 95% interval for y = 300. How would I do that?
Save model to data set to pass to PROC SCORE, call it modelParams.
Create data set to pass to PROC SCORE with data points required for prediction
Run PROC SCORE
Other options include:
Using a CODE statement to generate data step code to process a data set from Step 2
Adding in a fake data point to your original data, that is 300 but no y value so it gets a prediction
PROC PLM instead of SCORE, same functionality mostly
Manually save estimates to a data set and manually calculate the estimates
Use a SCORE statement within a PROC, if available.
Most are outlined here with examples.
Step 1
PROC REG data = steers outest=modelParams;
model y = x / cli;
run;
Step 2
data newValues;
input x;
cards;
300
;;;;
Step 3
proc score data=newValues score=modelParamstype=parms predict out=Pred;
var x;
run;
Im studying for my SAS test and i found the following question:
Given a dataset with 5000 observations, and given the following two dataset (a subset of the first).
Data test1 (keep= cust_id trans_type gender);
Set transaction;
Where trans_type = ‘ Travel’ ;
Run;
Data test2;
Set transaction (keep= cust_id trans_type gender);
Where trans_type = ‘ Travel’ ;
Run;
I. Which one of the above two datasets (test1, test2 ) will take less
I think that both take the same time to run because basically both made the same instructions in different order. Im right? or the order of instructions affects the runtime?
The answer the book is looking for is that test2 will be faster. That is because there is a difference in the two:
test1: Read in all variables, only write out 3 of them
test2: Read in 3 variables, write out all variables read
SAS has some advantages based on the physical dataset structure that allows it to more efficiently read in subsets of the dataset, particularly if those variables are stored consecutively.
However, in real world scenarios this may or may not be true, and in particular in a 5000 row dataset, probably won't see any difference between the two. For example:
data class1M;
set sashelp.class;
do _i = 1 to 1e6;
output;
end;
run;
data test1(keep=name sex);
set class1M;
run;
data test2;
set class1M(keep=name sex);
run;
Both of these data steps take the identical length of time. That's likely because the dataset is being read into memory and then bits are being grabbed as needed - a 250MB dataset just isn't big enough to trigger any efficiencies that way.
However, if you add a bunch of other variables:
data class1M;
set sashelp.class;
length a b c d e f g h i j k l 8;
do _i = 1 to 1e6;
output;
end;
run;
data test1(keep=name sex);
set class1M;
run;
data test2;
set class1M(keep=name sex);
run;
Now it takes a lot longer to run test1 than test2. That's because the dataset for test1 now doesn't fit into memory, so it's reading it by bits, while the dataset for test2 does fit in memory. Make it a lot bigger row-wise, say 10M rows, and it will take a long time for both test1 and test2 - but a bit shorter for test2.
The test2 will take less time to run, as it brings in less variables.
I have a small dataset consisting of three distinct observations on each of three variables, say x1 x2, x3 and the accompanying response y, on which I would like to perform Analysis of Variance to test whether the means are equal.
data anova;
input var obs resp;
cards;
1 1 1.1
1 2 .5
1 3 -2.1
2 1 4.2
2 2 3.7
2 3 .8
3 1 3.2
3 2 2.8
3 3 6.3
;
proc anova data=anova;
class var;
model resp=var;
run;
All good so far. Now however, I would like to use a permutation test to check the p-value of the F-statistic. The test is quite simple and it consists of randomly reassigning the 9 observations to the 3 variables so that each variable has three observations, and calculating the F-statistic each time. The p-value would then be the proportion of these statistics that are higher than 4.39, the value of the F-test from the code above.
Normally I would do it by hand but there is 1680 possible combinations( 9!/(3!3!3!) )here so it would take me a week. Is there a more elegant way to accomplish this? Perhaps a way to wrap this procedure in a loop or a function?
I would be grateful for your help, thank you.
Resampling like this can be done with a few steps. First, following an example from SAS help, use PROC SQL to create a table with all the possible combinations of var and resp.
PROC SQL;
CREATE TABLE possibilities AS SELECT a.var, b.resp FROM
anova a CROSS JOIN anova b;
QUIT;
Then, resample from all the possible combinations of var and resp using PROC SURVEYSELECT.
PROC SURVEYSELECT
DATA = possibilities
OUT = permutations
METHOD = URS /* URS means unrestricted sampling, or sampling uniformly with
replacement */
SAMPSIZE = 9 /* Generate samples that are the same size as your original
data */
REP = 1000; /* Repeat 1000 times */
STRATA var / ALLOC = PROP; /* Make sure that we sample evenly from each of the
3 values of var */
RUN;
Then, using PROC GLM with a BY statement, calculate the F statistic for each replicate.
/* Data must be sorted to use a BY statement */
PROC SORT
DATA = permutations;
BY replicate;
RUN;
PROC GLM
DATA = permutations
NOPRINT
OUTSTAT = f_statistics;
CLASS var;
MODEL resp = var;
BY replicate;
WEIGHT numberhits;
QUIT;
Finally, in a data step, create a dummy variable indicating whether each F statistic is greater than 4.39, the test statistic from the example, and then use the MEANS procedure to get the fraction of times that was true. The final answer should be approximately 0.068, as in the original example.
DATA final;
SET f_statistics;
IF f ne .; /* Drop rows of the GLM output that don't contain an F statistic */
reject_null = (f > 4.39);
RUN;
PROC MEANS
DATA = final
MEAN;
VAR reject_null;
RUN;
This is my approach. It employs the allperm routine of SAS to populate all permutations. Then in order to eliminate duplicates, I use the product of the numbers in a group as the key. The result is 1680.
After this, you can use by keyword of proc glm to run based on the group indicator group in the final data set.
proc transpose data=anova(keep=resp) out=anova1;
run;
quit;
data anova1;
set anova1;
n = fact(9);
array rs (*) col1-col9;
do i=1 to n;
call allperm(i, of rs(*));
a = rs(1)*rs(2)*rs(3);
b = rs(4)*rs(5)*rs(6);
c = rs(7)*rs(8)*rs(9);
file resperms notitles;
put rs(*) a b c;
end;
run;
data perms;
infile resperms;
input x1-x3 y1-y3 z1-z3 a b c;
run;
proc sort data=perms nodupkey;
by a b c;
run;
data perms;
set perms;
group = _N_;
drop a b c;
run;
proc transpose data=perms out=perms;
by group;
run;
quit;
data perms;
set perms;
var = substr(_NAME_,1,1);
obs = substr(_NAME_,2,1)*1;
rename col1=resp;
drop _NAME_;
run;
I need to calculate P-values linked to these t-values will reflect the probability that they are equal to 0 by using PROC Means for each visit. Also I have only one classifying (i,e Treatment group=1). Alpha for this test is set to 0.05.
Following code is what I am using, but I am not getting desired p-value?
PROC MEANS DATA=< input dataset>ALPHA=0.05;
BY _VISNAME;
VAR Total_Score;
OUTPUT OUT=D1<output dataset>
N=
MEAN=
MEDIAN=
STD=
STDERR=
UCLM=
LCLM=
PROBT= / AUTONAME;
RUN;
Please suggest where I am going wrong.
I cannot see your data set, so I can only guess at question. For starters, I would put your descriptive statistical keywords into the top line next to data=. This should at least generate some form of result:
proc means data=work.sample N MEAN MEDIAN STD STDERR UCLM LCLM PROBT T;
BY _VISNAME;
VAR Total_Score;
output out=work.means;
run;
This question is pretty simple (I hope). I am working my way through some introductory to SAS material and cannot find the proper way of running a two sample proportion test of location.
proc freq data;
tables / binomial (p=...)
run;
requires a known proportion (i.e. testing against a known value). I'd like to compare two samples of categorical variables with null hypothesis p1 = p2 and p1 < p2.
Data resembles:
V1 Yes
V1 No
V2 Yes
V2 No
For many lines. I need to compare the proportion of Yes's and No's between the two populations (V1 and V2). Can someone point me towards the correct procedure? Google search has left me spinning.
Thanks.
1/0 and 1/0 seems like a Chi Squared test.
data for_test;
do _t = 1 to 20;
x1 = ifn(ranuni(7)<0.5,1,0);
x2 = ifn(ranuni(7)<0.5,1,0);
output;
end;
run;
proc freq data=for_test;
tables x1*x2/chisq;
run;
The chi square test for quality of proportion. Here the use case of p1 < p2 is still pending. It is possible to do the test for p1 < p2 using chi