concatenating two datasets and getting unexpected results in SAS - io

I came across a rather weird situation where I concatenated two datasets and got really unexpected results. Here's the setup:
data a;
do i=1 to 100;
output;
end;
run;
data b;
do j=5 to 79;
output;
end;
run;
data c;
set a b;
if j=. then j=i;
run;
Can someone explain to me why the j takes on the value of "1" for the first 100 observations? It looks like j is being retained, but even weirder, if I change the conditional to:
data c;
set a b;
if i=. then i=j;
run;
Then a "retain" statement" would imply that after the 100th observation, you should see 100 retained down, but it isn't! It's 5! What is going on?
Thanks,

All variables that are sourced from input datasets are "retained". You just don't normally notice it because the values are overwritten when the SET statement executes. In your example dataset B does not contribute to the first few observations so the value you put into the variable J on the first iteration of the data step is not overwritten until you start reading from dataset B.
The other wrinkle is that when you do make the transition from reading dataset A to reading dateset B then the variables that are sourced from A are cleared.
Adding some PUT statements to your data step so you can watch what is happening will make it clearer.
data both;
put / 'TOP :' (_n_ i j) (=);
set a(obs=3) b(obs=3) ;
put 'MIDDLE:' (_n_ i j) (=);
if missing(j) then j=i ;
put 'BOTTOM:' (_n_ i j) (=);
run;
Results
TOP :_N_=1 i=. j=.
MIDDLE:_N_=1 i=1 j=.
BOTTOM:_N_=1 i=1 j=1
TOP :_N_=2 i=1 j=1
MIDDLE:_N_=2 i=2 j=1
BOTTOM:_N_=2 i=2 j=1
TOP :_N_=3 i=2 j=1
MIDDLE:_N_=3 i=3 j=1
BOTTOM:_N_=3 i=3 j=1
TOP :_N_=4 i=3 j=1
MIDDLE:_N_=4 i=. j=5
BOTTOM:_N_=4 i=. j=5
TOP :_N_=5 i=. j=5
MIDDLE:_N_=5 i=. j=6
BOTTOM:_N_=5 i=. j=6
TOP :_N_=6 i=. j=6
MIDDLE:_N_=6 i=. j=7
BOTTOM:_N_=6 i=. j=7
TOP :_N_=7 i=. j=7

Related

Igor Pro 8, functions for comparing strings

Hi I'm pretty new to using Igor Pro. I'm looking for some help on writing a procedure for a task.
I have 4 waves, two are text waves and two are numeric waves(one of which has no data yet). I need to write a function that will compare the two text waves and if they are equal, have igor pull data from one of the numeric waves and put it in the correct point to match the text wave it's coupled with.
To make it visually conceptually
twave1 twave2
nwave1 nwave2
twave1 is a list of all isotopes up to neptunium but they're not in order, and nwave1 is their corresponding mass values. (both on table1)
twave2 is the same list of isotopes but ordered properly (i.e. 1H, 2H, 3H, 4H...3He, 4He...ect) and nwave2 is empty (both on table2)
so the goal is to create a function that will sort through twave1 and twave2, and if they match, pull the data from nwave1 into nwave2, so that the masses match with the correct isotopes on table2. So table2 will have the correctly ordered isotopes and now the mass data as well, in the correct places.
Any help would be greatly appreciated; this is where I've gotten so far
function assignMEf()
wave ME, ME_FRIB
wave isotope_FRIB, isotope
variable len = numpnts(ME)
variable i, j
variable ME_current, iso_current
for(i=0; i<len; i+=1)
ME_current = ME[i]
iso_current = isotope[i]
for(j=0; j<4254; j+=1)
if(iso_current == isotope_frib[j])
ME_frib = ME[i]
endif
endfor
endfor
end
If I understood correctly, the two waves you want at the end are isotope and ME. Your code was close to working, however you need to tell Igor when you declare a text wave that it is a text wave, by using the /t flag. I simplified the code a bit more:
function assignMEf()
wave ME, ME_FRIB
wave/t isotope, isotope_FRIB
variable len = numpnts(ME)
variable i, j
for(i = 0; i < len; i += 1)
for(j = 0; j < len; j += 1)
if(stringmatch(isotope[i],isotope_frib[j]))
ME[i] = ME_FRIB[j]
endif
endfor
endfor
end
This code is not very robust but works for what you'd like to do.
To test the code, here is my MWE:
•Make/O/N=(10) ME_FRIB = p
•Make/O/N=(10) ME = NaN
•Make/O/N=(10)/T isotope_FRIB = "iso" + num2str(10 - p)
•Duplicate/O isotope_FRIB,isotope
•Sort isotope,isotope
•Edit isotope_FRIB,ME_FRIB,isotope,ME
•assignmef()
I don't think stringmatch is the right choice here. It uses wildcard matching but the OP AFAIU wants match/no-match so !cmpstr is a better choice.

Is it possible to create a string of alphabet characters (seperated by ands) based on the number of members of a global variable in SAS?

I am looking to build a system that merges data sets based on common values using the merge function.
However I know that sometimes I will have to merge 2 data sets, and sometimes I will have to merge 20 data sets.
so the generic merge function looks as follows
DATA [data_set];
merge
[loop over data to be merged]
by [factors which I am merging by]
if [a and b]
format [sort order]
run;
The problem is that after the if the [a and b] obviously needs to be a string generated to be the length equal to the number of tables being merged. If I want to merge 2 tables [a and b] is great but if I want to merge 3 tables it would have to be [a and b and c].
Is there a way for me to generate a string [a and b and.... N] where is produced based on the length of a global variable?
Hopefully my question is clear, I cannot provide the actual code I am using as it contains sensitive information. I will try to provide more information / answer questions as best I can if there is something I missed.
Using the makedata function from #Richard
%let MergeData = X1 X2 X3 X4 X5 X6 X7 X8 X9 X10;
DATA X_COMB;
merge
%internalmacro1(
%nrstr(
#L1#;
)
,L1 = &MergeData.
);
by id;
if ; /* create a string based on %&MergeData length, of # and # and # ... where # = a,b,c,d */
run;
but where that last if exists I want the number of items in the %MergeData set, I think I need to make an array and convert to hex for the values maybe? There must be some equivalent hex value for 'a b c' etc.
The problem is that people are going to add and remove items to the set &MergeData so the merge I'm trying to create needs to scale to the size of the number of data sets being input? Sorry I can't provide more!
Rosie Method:
proc import
datafile = "FilePath\Alphabet.csv"
DBMS = csv
OUT = AlphabetConversion;
;
data desiredstatements;
set AlphabetConversion (obs=26); /*This limits the observations used*/
run;
proc sql;
select AlphabetConversion into :dynamiccode from desiredstatements separated by " and ";
quit;
%put &dynamiccode.; /*Check the log to see what you got and make sure it's the code you want */
Quentin Method:
DATA Merged_Data
merge
mydata1 (in=_mydata1) mydata2(in=_mydata2) mydata3 (in=_mydata3);
by ID;
if _mydata1 and _mydata2 and _mydata3;
This structure is fine for a merge where you get to specify all of your inputs. My problem is that I am trying to write a macro that will take mydata1 and mydata2 sometimes, and mydata1-mydata20 other times. I don't know how to make the if _mydata1 and _mydata2 .... _mydata20, when it has 20 data sets to be merged, and _mydata1 and _mydata2 when it only has two.
Perhaps you can some example code, based on this data
%macro makedata;
%local i;
data %do i = 1 %to 10; x&i(keep=id x&i) %end;;
do id = 1 to 42;
array x(10);
do _n_ = 1 to dim(x);
x(_n_) = id * 100 + _n_;
end;
output;
end;
run
%mend;
%makedata;
data want;
merge ... fill in the rest ...;
... fill in the rest ...
run;
Given a bunch of data sets like:
data mydata1 ;
input id x1 ;
cards ;
1 10
2 20
3 30
;
data mydata2 ;
input id x2 ;
cards ;
1 100
3 300
;
data mydata3 ;
input id x3 ;
cards ;
1 1000
2 2000
3 3000
;
You can merge them together and keep only the records that match between all three data sets like:
data all ;
merge
mydata1 (in=_mydata1)
mydata2 (in=_mydata2)
mydata3 (in=_mydata3)
;
by id ;
if _mydata1 and _mydata2 and _mydata3 ;
run ;
If you look at the above step, it's clear that there are two lists. A list of data sets on the merge statement, and a list of variables on the IF statement. You can use the macro language to generate that step. When you call the macro you would pass it a list of data sets to be merged. The macro would then generate the DATA step with the merge.
Here's a macro, which uses macro looping to generate the two lists:
%macro innermerge
(data= /*space-delimited list of data sets to be merged*/
,by= /*space-delimited list of BY variables for merge*/
,out= /*output data set*/
)
;
%local i data_i ;
data &out ;
merge
%do i=1 %to %sysfunc(countw(&data,%str( ))) ;
%let data_i=%scan(&data,&i,%str( )) ;
&data_i (in=_&data_i)
%end ;
;
by &by ;
if
%do i=1 %to %sysfunc(countw(&data,%str( ))) ;
%let data_i=%scan(&data,&i,%str( )) ;
%if &i>1 %then %do ;
and
%end ;
_&data_i
%end ;
;
run ;
%mend;
Use like:
%innermerge
(data=mydata1 mydata2
,by=id
,out=want
)
MPRINT(INNERMERGE): data want ;
MPRINT(INNERMERGE): merge mydata1 (in=_mydata1) mydata2 (in=_mydata2) ;
MPRINT(INNERMERGE): by id ;
MPRINT(INNERMERGE): if _mydata1 and _mydata2 ;
MPRINT(INNERMERGE): run ;
NOTE: There were 3 observations read from the data set WORK.MYDATA1.
NOTE: There were 2 observations read from the data set WORK.MYDATA2.
NOTE: The data set WORK.WANT has 2 observations and 3 variables.
%innermerge
(data=mydata1 mydata2 mydata3
,by=id
,out=want
)
MPRINT(INNERMERGE): data want ;
MPRINT(INNERMERGE): merge mydata1 (in=_mydata1) mydata2 (in=_mydata2) mydata3 (in=_mydata3) ;
MPRINT(INNERMERGE): by id ;
MPRINT(INNERMERGE): if _mydata1 and _mydata2 and _mydata3 ;
MPRINT(INNERMERGE): run ;
NOTE: There were 3 observations read from the data set WORK.MYDATA1.
NOTE: There were 2 observations read from the data set WORK.MYDATA2.
NOTE: There were 3 observations read from the data set WORK.MYDATA3.
NOTE: The data set WORK.WANT has 2 observations and 4 variables.
You're close with your attempt of the code I gave you earlier, but the one key thing (other than me giving you the wrong order for the SQL statement - sorry about that, fixed below) is that you are reading in a CSV and assuming it has headers, whereas I would prefer to define the variable in the code. And then you are naming the dataset name when you should be naming the variable in the proc SQL - try this:
/*File must have one letter pet row*/
%let file=\\hefce-sas\nuser\user\thomaro\SAS\Temp\Alphabet.csv;
data AlphabetConversion;
infile "&file."
delimiter=',' missover dsd;
format letter $1.;
input letter $;
run;
/*I've explicitly defined this so it runs but this needs to be dynamic - see below*/
%let numstatements = 5;
data desiredstatements;
set AlphabetConversion (obs=&numstatements.); /*This limits the observations used*/
run;
proc sql;
select letter into :dynamiccode separated by ' and ' from desiredstatements;
quit;
/*Check the log to see what you got and make sure it's the code you want */
%put &dynamiccode.;
This construct is very useful in general for dynamically creating code.
Now there is the question of dynamically defining numstatements - this should be based on the line at the top of your code:
%let MergeData = X1 X2 X3 X4 X5 X6 X7 X8 X9 X10;
So I would suggest you want a macro function that counts words - there isn't one of these, but you can turn the normal function countw into a macro function using %sysfunc() - so you would want to replace the %let above with something like this:
%let numstatements = %sysfunc(countw(&mergedata.));
%put The number of datasets to be merged is &numstatements.;
Now (if I've understood your problem correctly) when you want to run the code with a different number of datasets, all you need to do is change your %let MergeData= at the top of the code and you're good to go.
If I've understood your question, does the following achieve what you want?:
%let ds_list = a b c d;
%let ds_and_ds = %sysfunc(tranwrd(&ds_list,%str( ),%str( and )));
%put ds_list = &ds_list;
%put ds_and_ds = &ds_and_ds;
If not, then please give an example of the data set list into which you want to insert "and" between each data set.

Deleting duplicate answers containing special characters in the survey

Hello I am dealing with the following problem. I have a survey where you were able to mark several answers as well as add your own one. I am trying to get unique answers to be able to count their occurance for example: let's suppose that we have 3 answers: a, b, c. Person nr 1 marked answer a, Person nr 2 marked answer b, c, Person nr 3 marked a, c. I would like to receive the result: "a" was marked 2 times. To do that i'm trying to delete duplicate answers and create a macro-variable that stores those unique answers: a, b, c.
I have already renamed all of the survey questions to v1-v&n_que. where n_que is a macro-variable that keeps information about the number of questions in the survey. I was trying to split all of the answers into a tables (using the previous example i would get a column with the following values): a, b, c, a, c. Then i wanted to sort this data out and remove duplicates. I've tried the following:
%macro coll_ans(lib, tab);
%do _i_ = 1 %to &n_que. %by 1;
%global betav&_i_.;
proc sql noprint;
select distinct v&_i_. into :betav&_i_. separated by ', '
from &lib..&tab.
where v&_i_. ^= ' ';
quit;
data a&_i_.;
%do _j_ = 1 %to %sysfunc(countw(%quote(&&&betav&_i_.), ',')) %by 1;
text = %scan(%quote(&&&betav&_i_), &_j_., ',');
output;
%end;
run;
%end;
%mend coll_ans;
It's worth mentioning that if somebody picked more than 1 answer, for example a and b, the answers are separated with the comma, that's why i picked this separator, to unify the record.
I have tried almost everything, changing %quote to %bquote, %superq, writing && instead of &&& and i keep getting the following error (first of 40 others):
ERROR: The function NO is unknown, or cannot be accessed.
"NO" is one of the answer to the first question in the survey, full answer is: NO (go to the 9th question). It's worth mentioning that the whole survey is in polish but i am using the right encoding so i don't believe it may cause some problems (hopefully).
I will be grateful for all the advice, because I encountered an insurmountable wall
Guessing you have a data set like:
data have ;
input id v1 : $8. v2 : $8.;
cards ;
1 a a
2 b,c b
3 a,c c
;
You can transpose the data set to make it have one record per ID-variable-value.
data tran (keep=id VarName Value);
set have ;
array vars{*} v1 v2 ;
do i=1 to dim(vars) ;
Varname=vname(vars{i}) ;
do j=1 to countw(vars{i},',') ;
Value=scan(vars{i},j,',') ;
output ;
end ;
end ;
run ;
The output data set looks like:
id Varname Value
1 v1 a
1 v2 a
2 v1 b
2 v1 c
2 v2 b
3 v1 a
3 v1 c
3 v2 c
You can the use PROC FREQ or SQL to get the counts.
proc freq data=tran ;
tables varname*value/missing list ;
run ;
Outputs
Varname Value Frequency
v1 a 2
v1 b 1
v1 c 2
v2 a 1
v2 b 1
v2 c 1
First of all, it's would be better if you posted the format in which you receive the survey data as this would dictate the simplest/fastest approach overall.
Also, as a general rule, it's best to get the inputs & outputs right in non-macro SAS code then use macros to optimize the process etc. It's easier to debug that way - even for someone who's being using macros for a long time... :)
That said, from your Proc SQL code, it appears that:
a. you're receiving the answers in a single delimited text field, such as "a,b,c" or "b,c" or "a,b,z"
*** example data;
data work.answers;
length answer $10.;
input answer;
datalines;
a,b,c
a
b
b,c
NO
a,b,z
n
run;
*** example valid answer entries;
data work.valid;
length valid $10.;
input valid;
datalines;
a
b
c
NO
YES
run;
b. you want to validate each answer entry and generate counts, like:
NO 1
YES 0
a 3
b 4
c 2
Many ways to do this in SAS but for parsing tokenized text data, a de-duplicated lookup table using the hash object is handy. Code below also prints the following to the log for debugging/verification...
answer=a,b,c num_answers=3 val=a val=b val=c validated=a,b,c
answer=a num_answers=1 val=a validated=a
answer=b num_answers=1 val=b validated=b
answer=b,c num_answers=2 val=b val=c validated=b,c
answer=NO num_answers=1 val=NO validated=NO
answer=a,b,z num_answers=3 val=a val=b val=z -invalid validated=a,b, validated=a,b,
answer=n num_answers=1 val=n -invalid validated= validated=
Once you've mastered the declaration syntax for the hash object, it's quite logical and relatively fast. And of course you can add validation rules - such as upper case & lower case entries ...
*** first, de-duplicate your lookup table. ;
proc sort data=work.valid nodupkey;
by valid;
run;
data _null_;
length valid $10. answer_count 4. count 4. validated $10.;
retain count 0;
*** initialize & load hash object ;
if _N_ = 1 then do;
declare hash h(multidata: 'n', ordered: 'y');
rc = h.defineKey('valid');
rc = h.defineData('valid','count');
rc = h.defineDone();
do until(eof1);
set work.valid end=eof1;
h.add();
end;
end;
*** now process questions/answers;
do until(eof);
*** read each answer;
set answers end=eof;
num_answers=countw(answer);
putlog answer= num_answers= #;
*** parse each answer entry;
validated=answer;
do i=1 to num_answers;
val=scan(answer,i);
putlog val= #;
*** (optional) keep track of total #answers: valid + invalid;
answer_count+1;
*** check answer entry in lookup table;
rc= h.find(key:val);
*** if entry NOT in lookup table, remove from validated answer;
if rc ne 0 then do;
putlog "-invalid " #;
validated=tranwrd(validated,trim(val),' ');
end;
*** if answer found, increment counter in lookup table;
else do;
count+1;
h.replace();
end;
end;
putlog validated=;
end;
*** save table of answer counts to disk;
if eof then h.output(dataset: 'work.counts');
run;

How to save several iteration of a variable to Excel

I have something simple like the following, where I call the script for 5 iterations.
for n=i:5
(call script)
end
How can I save one variables output to excel. Say variable A changes for each iteration:
A=5
A=2.7
A=6
.
.
Can this be saved into Excel in one column?
Should I use:
xlswrite('output.xlsx',A,.....
With some range?
The best is to do something like this:
for i=1:5
% (call script)
A(i) = i; % Or to the obtained value
end
xlswrite('my_xls.xls',A);
If you want to save more values, then you could do something like this:
for i=1:5
% (call script)
A(i) = i; % Or to the obtained value
B(i) = i; % Or to the obtained value
end
M = vertcat(A(:)',B(:)');
xlswrite('my_xls.xls',M);
The xls file is created in your Matlab's current directory.
Hope this helps,

AWK reporting duplicate lines and count, program explanation

I found the following AWK program on the internet and tweaked it slightly to look at column $2:
{ a[$2,NR]=$0; c[$2]++ }
END {
for( k in a ) {
split(k,b,SUBSEP)
t=c[b[1]] # added this bit to capture count
if( b[1] in c && t>1 ) { # added && t>1 only print if count more than 1
print RS "TIMES ID" RS c[b[1]] " " b[1] RS
delete c[b[1]]
}
for(i=1;i<=NR;i++) if( a[b[1],i] ) {
if(t>1){print a[b[1],i]} # added if(t>1) only print lines if count more than 1
delete a[b[1],i]
}
}
}
Given the following file:
abc,2,3
def,3,4
ghi,2,3
jkl,5,9
mno,3,2
The output is as follows when the command is run:
Command: awk -F, -f find_duplicates.awk duplicates
Output:
TIMES ID
2 2
abc,2,3
ghi,2,3
TIMES ID
2 3
def,3,4
mno,3,2
This is fine.
I would like to understand what is happening in the AWK program.
I understand that the first line is loading each line into a multidimentional array ?
So first line of file would be a['2','1']='abc,2,3' and so on.
However I'm a bit confised as to what c[$2]++ does, and also what is the significance of split(k,b,SUBSEP) ??
Would appreciate it if someone could explain line by line what is going on in this AWK program.
Thanks.
The increment operator simply adds one to the value of the referenced variable. So c[$2]++ takes the value for c[$2] and adds one to it. If $2 is a and c["a"] was 3 before, its value will be 4 after this. So c keeps track of how many of each $2 value you have seen.
for (k in a) loops over the keys of a. If the value of $2 on the first line was "a", the first value of k will be "a","1" (with 1 being the line number). The next time, it will be the combination of the value of $2 from the second line and the line number 2, etc.
The split(k,b,SUBSEP) will create a new array b from the compound value in k, i.e. basically reconstruct the parts of the compound key that went into a. The value in b[1] will now be the value which was in $2 when the corresponding value in a was created, and the value in b[2] will be the corresponding line number.
The final loop is somewhat inefficient; it loops over all possible line numbers, then skips immediately to the next one if an entry for that ID and line number did not exist. Because this runs inside the outer loop for (k in a) it will be repeated a large number of times if you have a large number of inputs (it will loop over all input line numbers for each input line). It would be more efficient, at the expense of some additional memory, to just build a final output incrementally, then print it all after you have looped over all of a, by which time you have processed all input lines anyway. Perhaps something like this:
END {
for (k in a) {
split (k,b,SUBSEP)
if (c[b[1]] > 1) {
if (! o[b[1]]) o[b[1]] = c[b[1]] " " b[1] RS
o[b[1]] = o[b[1]] RS a[k]
}
delete a[k]
}
for (q in o) print o[q] RS
}
Update: Removed the premature deletion of c[b[1]].

Resources