SAS simplify the contents of a variable - string

In SAS, I've a variable V containing the following value
V=1996199619961996200120012001
I'ld like to create these 2 variables
V1=19962001 (= different modalities)
V2=42 (= the first modality appears 4 times and the second one appears 2 times)
Any idea ?
Thanks for your help.
Luc

For your first question (if I understand the pattern correctly), you could extract the first four characters and the last four characters:
a = substr(variable, 1,4)
b = substrn(variable,max(1,length(variable)-3),4);
You could then concatenate the two.
c = cats(a,b)
For the second, the COUNT function can be used to count occurrences of a string within a string:
http://support.sas.com/documentation/cdl/en/lefunctionsref/63354/HTML/default/viewer.htm#p02vuhb5ijuirbn1p7azkyianjd8.htm
Hope this helps :)

Make it a bit more general;
%let modeLength = 4;
%let maxOccur = 100; ** in the input **;
%let maxModes = 10; ** in the output **;
Where does a certain occurrence start?;
%macro occurStart(occurNo);
&modeLength.*&occurNo.-%eval(&modeLength.-1)
%mend;
Read the input;
data simplified ;
infile datalines truncover;
input v $%eval(&modeLength.*&maxOccur.).;
Declare output and work variables;
format what $&modeLength..
v1 $%eval(&modeLength.*&maxModes.).
v2 $&maxModes..;
array w {&maxModes.}; ** what **;
array c {&maxModes.}; ** count **;
Discover unique modes and count them;
countW = 0;
do vNo = 1 to length(v)/&modeLength.;
what = substr(v, %occurStart(vNo), &modeLength.);
do wNo = 1 to countW;
if what eq w(wNo) then do;
c(wNo) = c(wNo) + 1;
goto foundIt;
end;
end;
countW = countW + 1;
w(countW) = what;
c(countW) = 1;
foundIt:
end;
Report results in v1 and v2;
do wNo = 1 to countW;
substr(v1, %occurStart(wNo), &modeLength.) = w(wNo);
substr(v2, wNo, 1) = put(c(wNo),1.);
put _N_= v1= v2=;
end;
keep v1 v2;
The data I testes with;
datalines;
1996199619961996200120012001
197019801990
20011996199619961996200120012001
;
run;

Related

SAS - put empty space on string

I have a script to write a SAS program (txt) that looks like this:
/********* Import excel spreadsheet with model sepcs *****************/
proc import file = "&mydir\sample.xls" out = model dbms = xls replace;
run;
/********* Create program model *****************/
data model;
set model;
dlb = resolve(dlb);
dub = resolve(dub);
run;
data model;
set model;
where2 = tranwrd(where,"="," ");
where2 = tranwrd(where2,"<"," ");
where2 = tranwrd(where2,">"," ");
nword = countw(where2);
bounds = trim(dlb)!!" "!!trim(dub);
bounds = tranwrd(bounds,"="," ");
bounds = tranwrd(bounds,"<"," ");
bounds = tranwrd(bounds,">"," ");
nbounds = countw(bounds);
run;
proc sql noprint;
select max(nword) into: max_word from model ;
select max(nbounds) into: max_aux from model ;
select name into: list_var separated by " " from dictionary.columns where libname = "WORK" and memname = "IMP" ;
quit;
/******* Generate Model ********/
%macro generate_model;
data model;
set model;
attrib wherev length = $500.;
do i = 1 to countw(where2);
%do j = 1 %to %sysfunc(countw(&list_var));
if upcase(scan(where2,i)) = "%upcase(%scan(&list_var,&j))" and scan(where2,i) not in ("0","1","2","3","4","5","6","7","8","9") then do;
if missing(wherev) then wherev = trim(scan(where2,i));
else if index(wherev,trim(scan(where2,i))) = 0 then do;
wherev = trim(wherev)!!" "!!trim(scan(where2,i));
end;
end;
%end;
end;
drop i where2;
run;
data model;
set model;
attrib aux length = $500.;
do i = 1 to countw(bounds);
%do j = 1 %to %sysfunc(countw(&list_var));
if upcase(scan(bounds,i)) = "%upcase(%scan(&list_var,&j))" and scan(bounds,i) not in ("0","1","2","3","4","5","6","7","8","9") then do;
if missing(aux) then aux = trim(scan(bounds,i));
else if index(aux,trim(scan(bounds,i))) = 0 then do;
aux = trim(aux)!!" "!!trim(scan(bounds,i));
end;
end;
%end;
end;
drop i bounds;
run;
%mend;
%generate_model;
data outem.bound;
set outem.model;
attrib txt length = $2000.;
txt = "******************Macros for variable"!!trim(dep)!!"******;";
output;
txt = "%"!!"macro bound"!!trim(dep)!!";";
output;
if not missing(lb) then do;
txt ="LB="!!trim(lb)!!";";
output;
end;
if not missing(ub) then do;
txt ="UB="!!trim(ub)!!";";
output;
end;
if not missing(dlb) and not missing(lb) then do;
txt ="LB=MAX(LB,"!!trim(dlb)!!");";
output;
end;
if not missing(dlb) and missing(lb) then do;
txt ="LB="!!trim(dlb)!!";";
output;
end;
if not missing(dub) and not missing(ub) then do;
txt ="UB=MIN(UB,"!!trim(dub)!!");";
output;
end;
if not missing(dub) and missing(ub) then do;
txt ="UB="!!trim(dub)!!";";
output;
end;
txt = "%"!!"mend;";
output;run;
data outem.imp;
set outem.bound;
file "&mydir\3_generate_models\3_model.sas" lrecl = 2000;
put txt;
run;
The program works fine, however i can't manage to put empty space before UB or LB.
The output looks like this:
%macro boundHC0340;
LB= 1;
UB= 9;
%mend;
But I would like to get this:
%macro boundHC0340;
LB= 1;
UB= 9;
%mend;
The code already has some attempts to put empty space before UB and LB, but so far I couldn't manage.
I can put other characters and strings in there. I just can't put empty space before UB and LB in order to produce indented code.
I've tried something like this:
txt =" LB="!!trim(lb)!!";";
But the empty space before LB does nothing.
However if i write this:
txt ="******LB="!!trim(lb)!!";";
I get the asterisks on my program.
Any idea of what I'm missing here?
Thank you very much for your support.
Best regards
Ps: here's the hyperlink to sample xls file: sample.xls
Assuming that you have built the variable TXT with the value you want to see you just need to add a format to your final step. To avoid writing a lot of useless trailing blanks use the $VARYING format. You will need to calculate the length of your string to use that format.
data outem.imp;
set outem.bound;
file "&mydir\3_generate_models\3_model.sas" lrecl = 2000;
length= lengthn(txt);
put txt $varying2000. length;
run;
But it is probably easier to just skip all of the concatenation and just use the power of the PUT statement itself to write the program directly from your data. Then you can use things like pointer controls (#3) or named value lb= and other features of the PUT statement to format your program file.
data _null_;
set outem.model;
file "&mydir\3_generate_models\3_model.sas" ;
put 72*'*' ';'
/ '* Macros for variable ' dep ';'
/ 72*'*' ';'
/ '%macro bound' dep ';'
;
if not missing(lb) then put #3 lb= ';' ;
if not missing(ub) then put #3 ub= ';' ;
if not missing(dlb) and not missing(lb) then put
#3 'LB=MAX(LB,' dlb ');'
;
if not missing(dlb) and missing(lb) then put
#3 'LB=' dlb ';'
;
if not missing(dub) and not missing(ub) then put
#3 'UB=MIN(UB,' dub ');'
;
if not missing(dub) and missing(ub) then put
#3 'UB=' dub ';'
;
put '%mend bound' dep ';';
run;
Although looking at the logic of those IF statement why not reduce them to:
put #3 'LB=MAX(' lb ',' dlb ');' ;
put #3 'UB=MIN(' ub ',' dub ');' ;
I think this is the result of SAS applying left alignment by default for the $w. format of your variable when you use your put statement. You can override this by applying a format in the put statement and specifying what alignment you want to use:
data _null_;
file "%sysfunc(pathname(work))\example.txt";
a = " text here";
/*Approach 1 - default behaviour*/
/*No leading spaces on this line in output file (default)*/
put a;
/*Approach 2 - $varying + right alignment*/
/*We need to right align text while preserving the number of leading spaces, so use $varying. */
/*If every line is the same length, we can use $w. instead*/
/*Use -r to override the default format alignment*/
varlen = length(a);
put a $varying2000.-r varlen;
/*Approach 3 - manually specify indentation*/
/*Alternatively - ditch the leading spaces and tell SAS which column to start at*/
put #4 a;
run;
Try changing the last part of your code so it looks a bit like this (fix paths and dataset names as appropriate):
data bound;
set model;
attrib txt length = $2000.;
txt = "******************Macros for variable"!!trim(dep)!!"******;";
output;
txt = "%"!!"macro bound"!!trim(dep)!!";";
output;
if not missing(lb) then do;
/* LEADING SPACES ADDED HERE */
/* LEADING SPACES ADDED HERE */
/* LEADING SPACES ADDED HERE */
txt =" LB="!!trim(lb)!!";";
output;
end;
if not missing(ub) then do;
/* LEADING SPACES ADDED HERE */
/* LEADING SPACES ADDED HERE */
/* LEADING SPACES ADDED HERE */
txt =" UB="!!trim(ub)!!";";
output;
end;
if not missing(dlb) and not missing(lb) then do;
txt ="LB=MAX(LB,"!!trim(dlb)!!");";
output;
end;
if not missing(dlb) and missing(lb) then do;
txt ="LB="!!trim(dlb)!!";";
output;
end;
if not missing(dub) and not missing(ub) then do;
txt ="UB=MIN(UB,"!!trim(dub)!!");";
output;
end;
if not missing(dub) and missing(ub) then do;
txt ="UB="!!trim(dub)!!";";
output;
end;
txt = "%"!!"mend;";
output;
run;
data _null_;
set bound;
file "%sysfunc(pathname(work))\example.sas" lrecl = 2000;
varlen = length(txt);
put txt $varying2000.-r varlen;
run;
x "notepad ""%sysfunc(pathname(work))\example.sas""";
Contents of example.sas (based on sample xls):
******************Macros for variableHC0340******;
%macro boundHC0340;
LB= 1;
UB= 9;
%mend;

SAS Index on Array

I am trying to search for a keyword in a description field (descr) and if it is there define that field as a match (what keyword it matches on is not important). I am having an issue where the do loop is going through all entries of the array and . I am not sure if this is because my do loop is incorrect or because my index command is inocrrect.
data JE.KeywordMatchTemp1;
set JE.JEMasterTemp;
if _n_ = 1 then do;
do i = 1 by 1 until (eof);
set JE.KeyWords end=eof;
array keywords[100] $30 _temporary_;
keywords[i] = Key_Words;
end;
end;
match = 0;
do i = 1 to 100 until(match=1);
if index(descr, keywords[i]) then match = 1;
end;
drop i;
run;
Add another condition to your DO loop to have it terminate when any match is found. You might want to also remember how many entries are in the array. Also make sure to use INDEX() function properly.
data JE.KeywordMatchTemp1;
if _n_ = 1 then do;
do i = 1 by 1 until (eof);
set JE.KeyWords end=eof;
array keywords[100] $30 _temporary_;
keywords[i] = Key_Words;
end;
last_i = i ;
retain last_i ;
end;
set JE.JEMasterTemp;
match = 0;
do i = 1 to last_i while (match=0) ;
if index(descr, trim(keywords[i]) ) then match = 1;
end;
drop i last_i;
run;
You have two problems; both of which would be easy to see in a small compact example (suggestion: put an example like this in your question in the future).
data partials;
input keyword $;
datalines;
home
auto
car
life
whole
renter
;;;;
run;
data master;
input #1 description $50.;
datalines;
Mutual Fund
State Farm Automobile Insurance
Checking Account
Life Insurance with Geico
Renter's Insurance
;;;;
run;
data want;
set master;
array keywords[100] $ _temporary_;
if _n_=1 then do;
do _i = 1 by 1 until (eof);
set partials end=eof;
keywords[_i] = keyword;
end;
end;
match=0;
do _m = 1 to dim(keywords) while (match=0 and keywords[_m] ne ' ');
if find(lowcase(description),lowcase(keywords[_m]),1,'t') then match=1;
end;
run;
Two things to look at here. First, notice the addition to the while. This guarantees we never try to match " " (which will always match if you have any spaces in your strings). The second is the t option in find (I note you have to add the 1 for start position, as for some reason the alternate version doesn't work at least for me) which trims spaces from both arguments. Otherwise it looks for "auto " instead of "auto".

How to read a C generated binary file in Lua

I want to read a 32 bit integer binary file provided by another program. The file contains only integer and no other characters (like spaces or commas). The C code to read this file is as follows:
FILE* pf = fopen("C:/rktemp/filename.dat", "r");
int sz = width*height;
int* vals = new int[sz];
int elread = fread((char*)vals, sizeof(int), sz, pf);
for( int j = 0; j < height; j++ )
{
for( int k = 0; k < width; k++ )
{
int i = j*width+k;
labels[i] = vals[i];
}
}
delete [] vals;
fclose(pf);
But I don't know how to read this file into array using Lua.
I've tried to read this file using io.read, but part of the array looks like this:
~~~~~~xxxxxxxxyyyyyyyyyyyyyyzzzzzzzz{{{{{{{{{|||||||||}}}}}}}}}}}~~~~~~~~~xxxxxxxyyyyyyyyyyyyyyzzzzzz{{{{{{{{{{|||||||||}}}}}}}}}}}~~~~~~~~~xxyyyyyyyyyyyyyzzzzz{{{{{{|||}}}yyyyyyyyyyyz{{{yyyyyyyyÞľūơǿȵɶʢ˺̤̼ͽаҩӱľǿجٴȵɶʢܷݸ˺໻⼼ӱľǿ
Also the Matlab code to read this file is like this:
row = image_size(1);
colomn = image_size(2);
fid = fopen(data_path,'r');
A = fread(fid, row * colomn, 'uint32')';
A = A + 1;
B = reshape(A,[colomn, row]);
B = B';
fclose(fid);
I've tried a function to convert bytes to integer, my code is like this:
function bytes_to_int(b1, b2, b3, b4)
if not b4 then error("need four bytes to convert to int",2) end
local n = b1 + b2*256 + b3*65536 + b4*16777216
n = (n > 2147483647) and (n - 4294967296) or n
return n
end
local sup_filename = '1.dat'
fid = io.open(sup_filename, "r")
st = bytes_to_int(fid:read("*all"):byte(1,4))
print(st)
fid:close()
But it still not read this file properly.
You are only calling bytes_to_int once. You need to call it for every int you want to read. e.g.
fid = io.open(sup_filename, "rb")
while true do
local bytes = fid:read(4)
if bytes == nil then break end -- EOF
local st = bytes_to_int(bytes:byte(1,4))
print(st)
end
fid:close()
Now you can use the new feature of Lua language by calling string.unpack , which has many conversion options for format string. Following options may be useful:
< sets little endian
> sets big endian
= sets native endian
i[n] a signed int with n bytes (default is native size)
I[n] an unsigned int with n bytes (default is native size)
The arch of your PC is unknown, so I assume the data to read is unsigned and native-endian.
Since you are reading binary data from the file, you should use io.open(sup_filename, "rb").
The following code may be useful:
local fid = io.open(sup_filename, "rb")
local contents = fid:read("a")
local now
while not now or now < #contents do
local n, now = string.unpack("=I4", contents, now)
print(n)
end
fid:close()
see also: Lua 5.4 manual

Return all subsequences of a String

I'm trying to write pseudo-code and an algorithm in Matlab, to return all the subsequences of a string.
So the string X = {ABCD} will return XSubSequence = {A, B, C, D, AB, AC, AD, BC, BD, CD, ABC, ABD, BCD, ABCD}, order does not matter of course.
clear
x = 'ABC';
XSize = length(x);
count = 1;
i=1;
for i=1:XSize
ZSubSequence{count} = x(i);
count = count + 1;
for j=i+1:XSize
temp = strcat(x(i),x(j));
ZSubSequence{count} = temp;
count = count + 1;
for k=i+2:XSize
if j ~= k
temp = strcat(x(i), x(j), x(k));
ZSubSequence{count} = temp;
count = count + 1;
end
end
end
end
Is there any way to make this more dynamic, so I can add X of any size and it will be able to deal with it?
You might want to consider a completely different approach.
This this is a binary representation of decimal numbers from 1 to 2^length(x)-1. Meaning for your example 1100=12 will be AB and 0011=3 will be CD, 1000 will be A and 1111=2^4-1=15 will be ABCD and so on.
You might want to create this sequence and then translate it into the input output you have.
Example code:
x = 'ABCD';
XSize = length(x);
seq=dec2bin([1:2^XSize-1]);
And now all have left is translate it back to letters
for i=1:1:2^XSize-1
for j=1:1:XSize
if seq(i,j)=='1'
seq(i,j)=x(j);
else
seq(i,j)='_';
end
end
end
Obviously the '_' should be removed and the output formatted the way you want them to be.
This should do it. It only has one loop (no nesting), so it shoud be pretty fast.
x = 'ABCD';
n = length(x);
subseq = x.';
for ii = 2:n
subseq = strvcat(subseq, x(nchoosek(1:n,ii)));
end
subseq_deblanked = deblank(mat2cell(subseq, ones(size(subseq,1),1), n));
The results are:
subseq: char matrix where each row contains a subsequence padded with blank spaces.
subseq_deblanked: cell array of strings with the blank spaces removed, as you specified

Finding minimum moves required for making 2 strings equal

This is a question from one of the online coding challenge (which has completed).
I just need some logic for this as to how to approach.
Problem Statement:
We have two strings A and B with the same super set of characters. We need to change these strings to obtain two equal strings. In each move we can perform one of the following operations:
1. swap two consecutive characters of a string
2. swap the first and the last characters of a string
A move can be performed on either string.
What is the minimum number of moves that we need in order to obtain two equal strings?
Input Format and Constraints:
The first and the second line of the input contains two strings A and B. It is guaranteed that the superset their characters are equal.
1 <= length(A) = length(B) <= 2000
All the input characters are between 'a' and 'z'
Output Format:
Print the minimum number of moves to the only line of the output
Sample input:
aab
baa
Sample output:
1
Explanation:
Swap the first and last character of the string aab to convert it to baa. The two strings are now equal.
EDIT : Here is my first try, but I'm getting wrong output. Can someone guide me what is wrong in my approach.
int minStringMoves(char* a, char* b) {
int length, pos, i, j, moves=0;
char *ptr;
length = strlen(a);
for(i=0;i<length;i++) {
// Find the first occurrence of b[i] in a
ptr = strchr(a,b[i]);
pos = ptr - a;
// If its the last element, swap with the first
if(i==0 && pos == length-1) {
swap(&a[0], &a[length-1]);
moves++;
}
// Else swap from current index till pos
else {
for(j=pos;j>i;j--) {
swap(&a[j],&a[j-1]);
moves++;
}
}
// If equal, break
if(strcmp(a,b) == 0)
break;
}
return moves;
}
Take a look at this example:
aaaaaaaaab
abaaaaaaaa
Your solution: 8
aaaaaaaaab -> aaaaaaaaba -> aaaaaaabaa -> aaaaaabaaa -> aaaaabaaaa ->
aaaabaaaaa -> aaabaaaaaa -> aabaaaaaaa -> abaaaaaaaa
Proper solution: 2
aaaaaaaaab -> baaaaaaaaa -> abaaaaaaaa
You should check if swapping in the other direction would give you better result.
But sometimes you will also ruin the previous part of the string. eg:
caaaaaaaab
cbaaaaaaaa
caaaaaaaab -> baaaaaaaac -> abaaaaaaac
You need another swap here to put back the 'c' to the first place.
The proper algorithm is probably even more complex, but you can see now what's wrong in your solution.
The A* algorithm might work for this problem.
The initial node will be the original string.
The goal node will be the target string.
Each child of a node will be all possible transformations of that string.
The current cost g(x) is simply the number of transformations thus far.
The heuristic h(x) is half the number of characters in the wrong position.
Since h(x) is admissible (because a single transformation can't put more than 2 characters in their correct positions), the path to the target string will give the least number of transformations possible.
However, an elementary implementation will likely be too slow. Calculating all possible transformations of a string would be rather expensive.
Note that there's a lot of similarity between a node's siblings (its parent's children) and its children. So you may be able to just calculate all transformations of the original string and, from there, simply copy and recalculate data involving changed characters.
You can use dynamic programming. Go over all swap possibilities while storing all the intermediate results along with the minimal number of steps that took you to get there. Actually, you are going to calculate the minimum number of steps for every possible target string that can be obtained by applying given rules for a number times. Once you calculate it all, you can print the minimum number of steps, which is needed to take you to the target string. Here's the sample code in JavaScript, and its usage for "aab" and "baa" examples:
function swap(str, i, j) {
var s = str.split("");
s[i] = str[j];
s[j] = str[i];
return s.join("");
}
function calcMinimumSteps(current, stepsCount)
{
if (typeof(memory[current]) !== "undefined") {
if (memory[current] > stepsCount) {
memory[current] = stepsCount;
} else if (memory[current] < stepsCount) {
stepsCount = memory[current];
}
} else {
memory[current] = stepsCount;
calcMinimumSteps(swap(current, 0, current.length-1), stepsCount+1);
for (var i = 0; i < current.length - 1; ++i) {
calcMinimumSteps(swap(current, i, i + 1), stepsCount+1);
}
}
}
var memory = {};
calcMinimumSteps("aab", 0);
alert("Minimum steps count: " + memory["baa"]);
Here is the ruby logic for this problem, copy this code in to rb file and execute.
str1 = "education" #Sample first string
str2 = "cnatdeiou" #Sample second string
moves_count = 0
no_swap = 0
count = str1.length - 1
def ends_swap(str1,str2)
str2 = swap_strings(str2,str2.length-1,0)
return str2
end
def swap_strings(str2,cp,np)
current_string = str2[cp]
new_string = str2[np]
str2[cp] = new_string
str2[np] = current_string
return str2
end
def consecutive_swap(str,current_position, target_position)
counter=0
diff = current_position > target_position ? -1 : 1
while current_position!=target_position
new_position = current_position + diff
str = swap_strings(str,current_position,new_position)
# p "-------"
# p "CP: #{current_position} NP: #{new_position} TP: #{target_position} String: #{str}"
current_position+=diff
counter+=1
end
return counter,str
end
while(str1 != str2 && count!=0)
counter = 1
if str1[-1]==str2[0]
# p "cross match"
str2 = ends_swap(str1,str2)
else
# p "No match for #{str2}-- Count: #{count}, TC: #{str1[count]}, CP: #{str2.index(str1[count])}"
str = str2[0..count]
cp = str.rindex(str1[count])
tp = count
counter, str2 = consecutive_swap(str2,cp,tp)
count-=1
end
moves_count+=counter
# p "Step: #{moves_count}"
# p str2
end
p "Total moves: #{moves_count}"
Please feel free to suggest any improvements in this code.
Try this code. Hope this will help you.
public class TwoStringIdentical {
static int lcs(String str1, String str2, int m, int n) {
int L[][] = new int[m + 1][n + 1];
int i, j;
for (i = 0; i <= m; i++) {
for (j = 0; j <= n; j++) {
if (i == 0 || j == 0)
L[i][j] = 0;
else if (str1.charAt(i - 1) == str2.charAt(j - 1))
L[i][j] = L[i - 1][j - 1] + 1;
else
L[i][j] = Math.max(L[i - 1][j], L[i][j - 1]);
}
}
return L[m][n];
}
static void printMinTransformation(String str1, String str2) {
int m = str1.length();
int n = str2.length();
int len = lcs(str1, str2, m, n);
System.out.println((m - len)+(n - len));
}
public static void main(String[] args) {
Scanner scan = new Scanner(System.in);
String str1 = scan.nextLine();
String str2 = scan.nextLine();
printMinTransformation("asdfg", "sdfg");
}
}

Resources