Im studying for my SAS test and i found the following question:
Given a dataset with 5000 observations, and given the following two dataset (a subset of the first).
Data test1 (keep= cust_id trans_type gender);
Set transaction;
Where trans_type = ‘ Travel’ ;
Run;
Data test2;
Set transaction (keep= cust_id trans_type gender);
Where trans_type = ‘ Travel’ ;
Run;
I. Which one of the above two datasets (test1, test2 ) will take less
I think that both take the same time to run because basically both made the same instructions in different order. Im right? or the order of instructions affects the runtime?
The answer the book is looking for is that test2 will be faster. That is because there is a difference in the two:
test1: Read in all variables, only write out 3 of them
test2: Read in 3 variables, write out all variables read
SAS has some advantages based on the physical dataset structure that allows it to more efficiently read in subsets of the dataset, particularly if those variables are stored consecutively.
However, in real world scenarios this may or may not be true, and in particular in a 5000 row dataset, probably won't see any difference between the two. For example:
data class1M;
set sashelp.class;
do _i = 1 to 1e6;
output;
end;
run;
data test1(keep=name sex);
set class1M;
run;
data test2;
set class1M(keep=name sex);
run;
Both of these data steps take the identical length of time. That's likely because the dataset is being read into memory and then bits are being grabbed as needed - a 250MB dataset just isn't big enough to trigger any efficiencies that way.
However, if you add a bunch of other variables:
data class1M;
set sashelp.class;
length a b c d e f g h i j k l 8;
do _i = 1 to 1e6;
output;
end;
run;
data test1(keep=name sex);
set class1M;
run;
data test2;
set class1M(keep=name sex);
run;
Now it takes a lot longer to run test1 than test2. That's because the dataset for test1 now doesn't fit into memory, so it's reading it by bits, while the dataset for test2 does fit in memory. Make it a lot bigger row-wise, say 10M rows, and it will take a long time for both test1 and test2 - but a bit shorter for test2.
The test2 will take less time to run, as it brings in less variables.
Related
I used a proc reg to come up with my prediction intervals:
PROC REG data = steers;
model y = x / cli;
run;
Here's my output for the proc reg
I need to construct a 95% interval for y = 300. How would I do that?
Save model to data set to pass to PROC SCORE, call it modelParams.
Create data set to pass to PROC SCORE with data points required for prediction
Run PROC SCORE
Other options include:
Using a CODE statement to generate data step code to process a data set from Step 2
Adding in a fake data point to your original data, that is 300 but no y value so it gets a prediction
PROC PLM instead of SCORE, same functionality mostly
Manually save estimates to a data set and manually calculate the estimates
Use a SCORE statement within a PROC, if available.
Most are outlined here with examples.
Step 1
PROC REG data = steers outest=modelParams;
model y = x / cli;
run;
Step 2
data newValues;
input x;
cards;
300
;;;;
Step 3
proc score data=newValues score=modelParamstype=parms predict out=Pred;
var x;
run;
I have a problem in sas. I have a dataset where the numbers are stored as string. The problem is that the size and format of the numbers vary a lot. I will give an example:
data s;
x='123456789012345';
y=input(x,best32.);
z = '0.0001246564';
a = input(z,best32.);
put 'a='y;
put a;
keep y a;
run;
Output:
y=
1.2345679E14
a=
0.00012465644
As you can see I lose information in the large integer. How can I make my program in order to not lose information. As far as I understand this number has less than 15 digits sas large number. I really miss python where I just can set y = float(x).
No loss of information has occurred. You're just mistaking informat for format.
The informat best32. told SAS to import the string with a 32 width. All good, you have all 15 characters stored in the number.
Now if you want to see all 15 characters, you need to use a format wider than the default best12. format to see it on output:
data s;
x='123456789012345';
y=input(x,best32.);
z = '0.0001246564';
a = input(z,best32.);
put y= best32.;
put a= best32.;
keep y a;
run;
But even if you don't display it this way, the number y still is exactly equal to 123456789012345, if you were to do math with it or similar - you haven't lost any information, you just weren't displaying it right.
I am trying to compute the critical values for the two-sided Kolmogorov-Smirnov test (PROC NPAR1WAY doesn't output these!). This is calculated as c(a) * sqrt( (n+m)/(nm) ) where n and m are the number of observations in each dataset and c(a) = 1.36 for for confidence level a = 0.05.
Either,
A) is there a routine in SAS that will calculate these for me? (I've been searching a while) or,
B) what is the best way to compute the statistic myself? My initial approach is to select the number of rows from each data set into macro variables then compute the statistic, but this feel ugly.
Thanks in advance
A) Probably not, if you've searched all the relevant documentation.
B) That method sounds fine, but you can use a data step if you prefer, e.g.
data example1 example2;
set sashelp.class;
if _n_ < 6 then output example1;
else output example2;
run;
data _null_;
if 0 then set example1 nobs = n;
if 0 then set example2 nobs = m;
call symput('Kolmogorov_Smirnov_05',1.36 * sqrt((n+m)/(n*m)));
run;
%put &=Kolmogorov_Smirnov_05;
I have a small dataset consisting of three distinct observations on each of three variables, say x1 x2, x3 and the accompanying response y, on which I would like to perform Analysis of Variance to test whether the means are equal.
data anova;
input var obs resp;
cards;
1 1 1.1
1 2 .5
1 3 -2.1
2 1 4.2
2 2 3.7
2 3 .8
3 1 3.2
3 2 2.8
3 3 6.3
;
proc anova data=anova;
class var;
model resp=var;
run;
All good so far. Now however, I would like to use a permutation test to check the p-value of the F-statistic. The test is quite simple and it consists of randomly reassigning the 9 observations to the 3 variables so that each variable has three observations, and calculating the F-statistic each time. The p-value would then be the proportion of these statistics that are higher than 4.39, the value of the F-test from the code above.
Normally I would do it by hand but there is 1680 possible combinations( 9!/(3!3!3!) )here so it would take me a week. Is there a more elegant way to accomplish this? Perhaps a way to wrap this procedure in a loop or a function?
I would be grateful for your help, thank you.
Resampling like this can be done with a few steps. First, following an example from SAS help, use PROC SQL to create a table with all the possible combinations of var and resp.
PROC SQL;
CREATE TABLE possibilities AS SELECT a.var, b.resp FROM
anova a CROSS JOIN anova b;
QUIT;
Then, resample from all the possible combinations of var and resp using PROC SURVEYSELECT.
PROC SURVEYSELECT
DATA = possibilities
OUT = permutations
METHOD = URS /* URS means unrestricted sampling, or sampling uniformly with
replacement */
SAMPSIZE = 9 /* Generate samples that are the same size as your original
data */
REP = 1000; /* Repeat 1000 times */
STRATA var / ALLOC = PROP; /* Make sure that we sample evenly from each of the
3 values of var */
RUN;
Then, using PROC GLM with a BY statement, calculate the F statistic for each replicate.
/* Data must be sorted to use a BY statement */
PROC SORT
DATA = permutations;
BY replicate;
RUN;
PROC GLM
DATA = permutations
NOPRINT
OUTSTAT = f_statistics;
CLASS var;
MODEL resp = var;
BY replicate;
WEIGHT numberhits;
QUIT;
Finally, in a data step, create a dummy variable indicating whether each F statistic is greater than 4.39, the test statistic from the example, and then use the MEANS procedure to get the fraction of times that was true. The final answer should be approximately 0.068, as in the original example.
DATA final;
SET f_statistics;
IF f ne .; /* Drop rows of the GLM output that don't contain an F statistic */
reject_null = (f > 4.39);
RUN;
PROC MEANS
DATA = final
MEAN;
VAR reject_null;
RUN;
This is my approach. It employs the allperm routine of SAS to populate all permutations. Then in order to eliminate duplicates, I use the product of the numbers in a group as the key. The result is 1680.
After this, you can use by keyword of proc glm to run based on the group indicator group in the final data set.
proc transpose data=anova(keep=resp) out=anova1;
run;
quit;
data anova1;
set anova1;
n = fact(9);
array rs (*) col1-col9;
do i=1 to n;
call allperm(i, of rs(*));
a = rs(1)*rs(2)*rs(3);
b = rs(4)*rs(5)*rs(6);
c = rs(7)*rs(8)*rs(9);
file resperms notitles;
put rs(*) a b c;
end;
run;
data perms;
infile resperms;
input x1-x3 y1-y3 z1-z3 a b c;
run;
proc sort data=perms nodupkey;
by a b c;
run;
data perms;
set perms;
group = _N_;
drop a b c;
run;
proc transpose data=perms out=perms;
by group;
run;
quit;
data perms;
set perms;
var = substr(_NAME_,1,1);
obs = substr(_NAME_,2,1)*1;
rename col1=resp;
drop _NAME_;
run;
I need to calculate P-values linked to these t-values will reflect the probability that they are equal to 0 by using PROC Means for each visit. Also I have only one classifying (i,e Treatment group=1). Alpha for this test is set to 0.05.
Following code is what I am using, but I am not getting desired p-value?
PROC MEANS DATA=< input dataset>ALPHA=0.05;
BY _VISNAME;
VAR Total_Score;
OUTPUT OUT=D1<output dataset>
N=
MEAN=
MEDIAN=
STD=
STDERR=
UCLM=
LCLM=
PROBT= / AUTONAME;
RUN;
Please suggest where I am going wrong.
I cannot see your data set, so I can only guess at question. For starters, I would put your descriptive statistical keywords into the top line next to data=. This should at least generate some form of result:
proc means data=work.sample N MEAN MEDIAN STD STDERR UCLM LCLM PROBT T;
BY _VISNAME;
VAR Total_Score;
output out=work.means;
run;