How to implement the Jacccard macro in a data step for a matching problem? - text

I tried implementing a solution into my sas code but with no luck. I'm trying to add a jaccard distance column. to my dataset.
I keep getting errors :
variable name & is not valid
invalid value for the keep option
The idea is to solve a matching problem between two datasets and to take into consideration the typing errors.
data table_test;
input nom1 $3. nom2 $3.;
cards;
abcade
vdenfr
azfefs
;
run;
%macro kshingling
(string
,k=5
,out=&sysmacroname.
)
;
data &out.;
string = strip(prxchange('s#\s# #',-1,symget('string')));
do _n_ = 1 to lengthn(string)-&k.+1;
ngram = substr(string,_n_,&k.);
output;
end;
run;
%mend;
%macro jaccard
(string1
,string2
)
;
%kshingling(&string1.,k=2,out=s1)
%kshingling(&string2.,k=2,out=s2)
proc append base=s1 data=s2; run;
proc freq data=s1 noprint;
tables string*ngram / out=s2;
run;
proc transpose data=s2 out=s1(drop=_name_ _label_);
by string notsorted;
var count;
id ngram;
run;
proc stdize data=s1 out=s2 missing=0 reponly;
var _numeric_;
run;
proc distance data=s2 method=jaccard absent=0 out=s1;
var anominal(_numeric_);
id string;
run;
data t(keep=&string1.);
set s1(firstobs=2);
run;
data _null_;
set t;
call symput('Jaccard',&string1.);
%put Distance de Jaccard = &Jaccard;
run;
%mend;
data test;
set table_test;
call symput('n1',nom1);
call symput('n2',nom2);
%jaccard(&n1,&n2);
run;
data Jacc;
Dist_Jacc=&Jaccard;
run;
data Final; merge table_test Jacc; run;

You are mixing DATA step and macro in ways that are incorrect.
The SYMPUT occurs at runtime and the direct macro call %jaccard is processed at compilation time that occurs before runtime.
For instance:
data test;
set table_test;
call symput('n1',nom1);
call symput('n2',nom2);
%jaccard(&n1,&n2);
run;
Running jaccard for each record in table_test should be accomplished using something like the following DATA step that computes source code and then tells the session to execute it.
data _null_;
set table_test;
macro_call = '%nrstr(%jaccard)' || cats('(' , n1, ',', n2, ')');
call execute (macro_call);
run;

Looks to me like the OUTPUT of your macro is the dataset T. You can use PROC APPEND to aggregate the results of multiple macro calls into a single dataset. You can then combine that data with your input dataset of ngrams.
data _null_;
set table_test;
call execute(cats('%nrstr(%jaccard)(',nom1,',',nom2,');'));
call execute('proc append base=result data=t; run;');
run;
data want;
set table_test;
set result;
run;
BUT you will need to make sure the generated T dataset has THE EXACT SAME STRUCTURE each time.
So change the ending steps of the macro to this single step so that the dataset T always consists of ONE observation and ONE variable and the variable is named Jaccard. You can also use the %GLOBAL statement to make sure that the value of JACCARD macro variable is available after the macro finishes.
%if not %symexist(jaccard) %then %global jaccard;
data t ;
set s1(keep=&string1. rename=(&string1.=Jaccard) obs=2 firstobs=2);
call symputx('Jaccard',Jaccard);
run;
%put Distance de Jaccard = &Jaccard;

Related

Standardized tables with repeated labels using PROC TABULATE in SAS

In SAS, I need a PROC TABULATE where labels are repeated so that it's easier on Excel to find them using INDEX-MATCH. Here is an example with sashelp.cars.
The first PROC TABULATE has the advantage of having repeating labels, which is needed for the INDEX-MATCH. But, its flaw is that SAS only gives the non missing values.
data cars;
set sashelp.cars;
run;
proc sort data=cars;
by make;
run;
This doesn't give all labels. I would like a table with 3 continents by column (Europe, Asia, USA) and every car type (Sedan, SUV, Wagon, Sports...).
PROC TABULATE DATA = cars;
option missing=0;
by make;
CLASS make type Type Origin / mlf MISSING ;
TABLE (
(type*make)
), (Origin='') / printmiss nocellmerge ; RUN;
So, in order to have all the 3 continents by colum, and every type of car (Sedan, SUV, Wagon, Sports...), I use CLASSDATA, as suggested:
Data level;
set cars;
keep make type Type Origin;
Run;
PROC TABULATE DATA = cars MISSING classdata=level;
option missing=0;
by make;
CLASS make type Type Origin / mlf MISSING ;
TABLE (
(make*type)
), (Origin='') / printmiss nocellmerge ;
RUN;
Data level;
set cars;
keep make type Type Origin;
Run;
PROC TABULATE DATA = cars MISSING classdata=level;
option missing=0;
by make;
CLASS make type Type Origin / mlf MISSING ;
TABLE (
(make*type)
), (Origin='') / printmiss nocellmerge ;
RUN;
But this gives a humongous table, and non repeating labels. Is there a midway solution with :
all the columns (3 continents) like in the last table
only the concerned MAKEs, that is the first 6 rows for Acura
repeated labels like in the first PROC TABULATE
Thank you very much,
I advice not exporting the listing of proc tabulate to excel
proc tabulate does not repeat values in the first column for each value in the second, because the output is meant for human reading. This is not the tool you need to write data to excel for further lookup.
I advice not using MATCH but SUMIFS
MATCH is a great function in excel, but is not a good choice for your application, because
it gives an error when it does not find what you look for, and that is why you need all labels in your output
it only supports one criterion, so you need at least 3 of them
it returns a position, so you still need an index function.
Therefore, I advice writing a simple create table
PROC sql;
create table TO_EXPORT as
select REGION, MACTIV, DATE, count(*) as cnt
from data
group by REGION, MACTIV, DATE;
proc export data = TO_EXPORT file="&myFolder\&myWorkbook..xlsx" replace;
RUN;
you will have your data in Excel in a more data oriented format.
To retrieve the data, I advise the following type of excel formula
=sumifs($D:$D,$A:$A,"13-*",$B:$B,$C:$C,"apr2020")`
It adds all counts with left of them the criteria you are looking for.
Because at most one row will meet these criteria, it actually just looks up a count you are looking for.
If that count does not exist, it will just return zero.
Disclaimer:
I did not test this code, so if it does not work, leave a comment and I will.

Generate SAS code using input specifications

We have a set of predefined macros, developed in SAS, which are used for generating tables, listing and figures using SAS datasets. My requirement is specifically for automating SAS code generation for tables. There are some fixed numbers of templates available for the tables to be generated. Also, a SAS programs are available to generate each table output. Whenever a table from these templates needs to be generated, the related SAS program needs to be modified to generate the required output.
To avoid the redundacy of writing SAS programs each time separately to generate the fixed kind of output, I want to create a tool which will generate a SAS code. This code will serve as a source to generate tables.
I have prepared an excel workbook with all required details for constructing program (like various parameters and their values e.g. title, footnote, source dataset, group by row/column). A simple user form created in Excel is used to display and configure these parameters.
The question is now - how to generate SAS statements using Excel VBA? and How to connect with SAS using Excel VBA to execute SAS programs?
If you have done anything like this before or if you have any ideas for this type of problem statement, could you please share them here. Any help is really appreciable.
Thank you.
I wouldn't involve VBA. I would write a SAS program to read the requirements, generate the tables and push them back to a new Excel program. If you want to drive this from an Excel program entirely, I would recommend the SAS Add-In for Excel instead of VBA code, though you could trigger it via VBA. Chris Hemidinger is the SAS expert in these types of issues and he does post on communities.sas.com as another user indicated you should post there (in comments).
It's a trivial process to create empty tables, but reporting tables often have a very specific layout so you may be able to generalize this process further. For example, I have a macro that creates my standard Table 1 (table of characteristics) for a data set, where I just need to specify the input variables by type (continuous, categorical, binary) and the output is generated and can be pushed to excel via ODS EXCEL. I also use PROC REPORT to format it, because I'll shade the rows alternately for different variables so that's easier to read/display.
/*
This macro creates a table of charateristics for the variables listed.
It handles categorical, binary and continuous variables and produces and output dataset that can be further customized. No statistical information is
included in this analysis
*/
/*Parameters to be set:
dsetin - name of dataset to be analyzed
cont = macro variable list of variable names, ie cont=age weight height
cat=list of categorical variables ie cat=sex grade
bin=binary variables, such as smoking now, smoking ever
dsetout=name of output dataset
Run example at the end for a sample output dataset call sample_table_char in the work directory
*/
*options mprint symbolgen;
%macro table_char(dsetin, cont, cat, bin, dsetout);
*delete old dataset;
proc datasets nodetails nolist;
delete &dsetout;
quit;
/****************************************************************
Handle Categorical Variables
****************************************************************/
*loop through variable list;
%let i=1;
%do %while (%scan(&cat, &i, " ") ^=%str());
%let var=%scan(&cat, &i, " ");
*Get format for variable;
data _null_;
set &dsetin;
call symput("var_fmt", vformat(&var));
run;
proc freq data=&dsetin noprint;
table &var/missing out=tab_var;
run;
data temp1;
length categorical $200.; format categorical $200.;
length value $200.; format value $200.;
set tab_var;
percent=percent/100;
categorical=put(&var., &var_fmt.);
if _n_=1 then do;
value=put(count, 8.)||"("||compress(put(percent, percent8.1))||")";
order=2;
output;
order=1;
value='';
categorical=propcase(vlabel(&var.));
output;
end;
else do;
order=2;
value=put(count, 8.)||"("||compress(put(percent, percent8.1))||")";
output;
end;
keep categorical value order;
run;
proc sort data=temp1 out=temp2 (drop=order); by order categorical; run;
proc append base=&dsetout data=temp2;
run;
*clean up;
proc datasets nodetails nolist;
delete tab_var temp1 temp2;
run; quit;
*Increment counter;
%let i=%eval(&i+1);
%end; *Categorical;
/****************************************************************
Handle Continuous Variables
****************************************************************/
%let i=1;
%do %while (%scan(&cont, &i, " ") ^=%str());
%let var=%scan(&cont, &i, " ");
proc means data=&dsetin (rename=&var=vn) noprint;
var vn;
output out=table_var n= nmiss= mean= min= max= std= median= p25= p75= p90=/autoname;
run;
*get label of variable for clean reporting;
data _null_;
set &dsetin;
call symput("var_label", vlabel(&var));
run;
data temp1;
length categorical $200.; format categorical $200.;
format value $200.; length value $200.;
set table_var;
categorical="&var_label.";
value=.;
output;
categorical='Count(Missing)';
value=put(vn_n, 5.)||"("||compress(put(vn_nmiss, 5.))||")";
output;
categorical='Mean (SD)';
value=put(vn_mean, 8.1)||"("||compress(put(vn_stddev, 8.1))||")";
output;
categorical='Median (IQR)';
value=put(vn_median, 8.1)||"("||compress(put(vn_p25, 8.1))||" - "||compress(put(vn_p75, 8.1))||")";
output;
categorical='Range';
value=put(vn_min, 8.1)||" - "||compress(put(vn_max, 8.1));
output;
categorical='90th Percentile';
value=put(vn_p90, 8.1);
output;
keep categorical value;
run;
proc append base=&dsetout data=temp1;
run;
*clean up;
proc datasets nodetails nolist;
delete table_var temp1;
run; quit;
*Increment counter;
%let i=%eval(&i+1);
%end; *Continuous;
/*****************************************************************
Handle Binary Variables (only report 1s)
*****************************************************************/
%let i=1;
%do %while (%scan(&bin, &i, " ") ^=%str());
%let var=%scan(&bin, &i, " ");
proc freq data=&dsetin noprint;
table &var/missing out=tab_var;
run;
data tab_var;
set tab_var;
where &var=1;
run;
data temp1;
length categorical $200.; format categorical $200.;
length value $200.; format value $200.;
set tab_var;
percent=percent/100;
if _n_=1 then do;
value=put(count, 8.)||"("||compress(put(percent, percent8.1))||")";
order=1;
categorical=propcase(vlabel(&var.));
output;
end;
keep categorical value;
run;
proc append base=&dsetout data=temp1;
run;
*clean up;
proc datasets nodetails nolist;
delete tab_var temp1;
run; quit;
*Increment counter;
%let i=%eval(&i+1);
%end;*Binary;
%mend table_char;
/* *Example of macro usage; */
/* data sample; */
/* set sashelp.class; */
/* female=ifn( sex='F',1,0); */
/* run; */
/* */
/* */
/* %table_char(sample, height weight age, sex, female, sample_table_char); */

SAS adding a prefix to all words in a macro variable

I am adding a prefix to each word in a macro variable. However, when using my current method, the first word does not receive the prefix. Looking at my code, there is good reason for this as there is no space in front of the word.
The code i use is:
%LET independent_vars = FF_1 FF_4 FF_7 FF_10;
%LET log_independent_vars = %SYSFUNC(TRANWRD(&independent_vars.,%str( ),%str( ln_)));
%PUT &log_independent_vars.;
Current output is:
FF_1 ln_FF_4 ln_FF_7 ln_FF_10
Expected output is:
ln_FF_1 ln_FF_4 ln_FF_7 ln_FF_10
I've tried using prxchange but I don't understand it
Only the first space is stripped
You can circumvent this problem by adding one 'ln_' in front of your formula
%LET independent_vars = FF_1 FF_4 FF_7 FF_10;
%LET log_independent_vars = ln_%SYSFUNC(TRANWRD(&independent_vars.,%str( ),%str( ln_)));
%PUT &log_independent_vars.;
ln_FF_1 ln_FF_4 ln_FF_7 ln_FF_10

Sample 33078: How to find a specific value in any variable in any SASĀ® data set in a library

I need help modifying this code from SAS (http://support.sas.com/kb/33/078.html) to be:
Not case sensitive (therefore not overlooking SMITH versus Smith versus smith, I tried "upcase" but it won't work)
Include a counter (so that I can control for either knowing the first time a value appears and if needed, how many times the value appears)
Allow for a partial search (this code only allows for exact match to be searched which means I am missing many possible variables that the value could be defined under)
Thanks! :)
From your comment:
data _null_;
set &librf..&&ds&i;
%do j=1 %to &numvars;
if INDEX(upcase(&&var&j),"&string") >0 then
/*modified this part to satisfy the first and third things that I wanted*/
put "String &string found in dataset &librf..&&ds&i for variable &&var&j"
;
%end;
run;
So just add code to increment the counter. Do you want to count observations or occurrences? That is if the same observation has multiple hits does it count as one or multiple?
Counting each hit is easier:
data _null_;
set &librf..&&ds&i;
%do j=1 %to &numvars;
if INDEX(upcase(&&var&j),"&string") >0 then do;
_count+1;
put "String &string found in dataset &librf..&&ds&i for variable &&var&j" _count=;
end;
%end;
run;
Here is how you might count each observation.
data _null_;
set &librf..&&ds&i;
%do j=1 %to &numvars;
if INDEX(upcase(&&var&j),"&string") >0 then do;
_hit=1;
put "String &string found in dataset &librf..&&ds&i for variable &&var&j";
end;
%end;
if _hit then do;
_count+1;
put "Number of observations so far=" _count ;
end;
run;
Assuming you are running the code in the sample. I would change the comparison expression.
I would make it a macro parameter. You can use FIND/W/C, regular expression, etc.
exp=%str(find(&&var&j,'-target-','IT')),
%unquote(&exp) to replace underlined in red.

Numeric Operand Is Required?

Does anyone know how i can correct the following?
CODE:
%macro variables(list);
data tire.Import2(drop=i count);
set tire.Import;
by Away_Team;
%let n=%sysfunc(countw(&list));
%DO k=1 %TO &n;
%let val = %scan (&list,&k);
array x(*) &val.lag1-&val.lag6;
&val.lag1=lag1(&val);
&val.lag2=lag2(&val)+lag1(&val);
&val.lag3=lag3(&val)+lag2(&val)+lag1(&val);
&val.lag4=lag4(&val)+lag3(&val)+lag2(&val)+lag1(&val);
&val.lag5=lag5(&val)+lag4(&val)+lag3(&val)+lag2(&val)+lag1(&val);
&val.lag6=lag6(&val)+lag5(&val)+lag4(&val)+lag3(&val)+lag2(&val)+lag1(&val);
%if %str(first.Away_Team) %then count=1;
%do i=count %to dim(x);
x(i)=.;
%end;
count + 1;
run;
%end;
%mend;
%variables(FTHG FTHGC);
ERROR: A character operand was found in the %EVAL function or %IF
condition where a numeric operand is required. The condition was:
%str(first.Away_Team) ERROR: The macro VARIABLES will stop executing.
I tried using %bquote and %str but no luck!
Your macro is confusing macro %IF statements vs data step IF statements, and %DO loops vs DO loops. The macro language (%IF %DO etc) is used to generate SAS code. It does not know about SAS datasets or the values of dataset variables. It's just a text processing language. The SAS data step language (IF DO etc) is used to read and process data.
When you write:
%if %str(first.Away_Team) %then count=1;
this is a macro %IF statement. The macro language does not know about dataset variables such as first.Away_Team. So the macro %if statement is testing whether the expression %str(first.Away_Team) is true or not. This is just a text string to the macro language; it does not know that first.Away_Team is a data step variable that has a value of 1 or 0. So it throws an error.
This should be just a regular IF statement:
if first.Away_Team then count=1
The data step IF statement can test the value of first.Away_Team.
Similar for your %DO loop at the end. You cannot: %do i=count %to dim(x); because the macro language %DO statement does not know that COUNT is a dataset variable with a value, or that dim(x) is the number of elements in an array named x. To the macro language, they are both text strings. You can do i=count to dim(x);
I would suggest you start by writing your data step with no macros or macro variables at all, and get that working like you want it, for FTHG and FTHGC. Then after you know what the working SAS code is, you can try write a macro that will generate that code.

Resources