GREP-like function to retrieve text in SAS

GREP-like function to retrieve text in SAS - string

I want to retrieve specific text within a column in a SAS file.
The file would like the following:
Patient Location infoTxt
001 B Admission Code: 123456 X
Exit Code: 98765W
002 C Admission Code: 4567 WY
Exit Code: 76543Z
003 D Admission Code: 67890 L
Exit Code: 4321Z
I want to retrieve just the information after the colon for Admission Code and Exit Code and put them in their own columns. The 'codes' can be any combination of letters, numbers, and blank spaces. The new data would look like the following:
Patient Location AdmissionCode ExitCode
001 B 123456 X 8765W
002 C 4567 WY 76543Z
003 D 67890 L 4321Z
I'm not familiar with the functions in SAS, but maybe the logic would look something like the following:
data want;
set have;
do i = 1 to dim(infoTxt)
AdmissionCode = substring(string1, regexpr(":", string) + 1);
ExitCode = substring(string2, regexpr(":", string) + 1);
run;
In the code above, string1 would represent the first line of text in infoTxt and string2 would represent the second line of text infoTxt.

SAS can utilize Perl regular expressions through the family of functions that start with PRX. The tip sheet is a great summary if you are familiar with regular expressions.
PRXMATCH and PRXPOSN can test a regex pattern with capture groups and retrieve the group text.
data have;
input;
text = _infile_;
datalines;
Admission Code: 123456 X Exit Code: 98765W
Admission Code: 4567 WY Exit Code: 76543Z
Admission Code: 67890 L Exit Code: 4321Z
run;
data want;
set have;
if _n_ = 1 then do;
retain rx;
rx = prxparse ('/Admission Code: (.*)Exit Code:(.*)/');
end;
length AdmissionCode ExitCode $50;
if prxmatch(rx,text) then do;
AdmissionCode = prxposn(rx, 1, text);
ExitCode = prxposn(rx, 2, text);
end;
drop rx;
run;

I like a RegEX with a capture buffer as much as the next guy but you could also use input statement features to read this data.
data info;
infile cards n=2 firstobs=2;
input #1 patient:$3. location :$1. #'Admission Code: ' AdmissionCode &$16. #2 #'Exit Code: ' ExitCode &$16.;
cards;
Patient Location infoTxt
001 B Admission Code: 123456 X
Exit Code: 98765W
002 C Admission Code: 4567 WY
Exit Code: 76543Z
003 D Admission Code: 67890 L
Exit Code: 4321Z
;;;;
run;
proc print;
run;

There may be a solution out there that does it all in one data step. This creates two steps to deal with the admission and exit being on different rows-- first a data step, then a join to get it back together.
SAS does have regex syntax but I used SAS character functions instead. substr has 3 arguments, string, start position, and end position-- but end position is optional and I've omitted it to tell it to grab everything after the start position. retain is used to fill in the patient and location in the second row of each group.
data admission exit;
set grep;
retain patient2 location2;
if patient ne '' then do;
patient2=patient;
location2=location;
admissioncode=substr(infoTxt,find(infoTxt,":")+2);
output admission;
end;
else do;
exitcode=substr(infoTxt,find(infoTxt,":")+2);
output exit;
end;
run;
proc sql;
create table dat as select a.patient2 as patient,a.location2 as location,a.admissioncode,b.exitcode
from admission a
left join exit b on a.patient2=b.patient2 and a.location2=b.location2
;
quit;

Provided that you always have the same pattern of colons and line breaks, I think you can do this with scan:
admission_code = scan(infoTxt, 2, '3A0A0D'x);
exit_code = scan(infoTxt, 4, '3A0A0D'x);
This uses the hex literal '3A0A0D'x to specify :, line feed and carriage return as delimiters for the scan function.

Related

Find and replace acronyms by full province names in SAS

I need to replace a string in a SAS dataset in the following way :
OTTAWA ON should be replaced with OTTAWA ONTARIO
WHATEVER QC should be replaced with WHATEVER QUEBEC
etc.
However, HOUSE ON THE HILL should not become HOUSE ONTARIO THE HILL.
That is, I want to replace all instances of ON with ONTARIO but only if ON exists as the last word in the string

You could use regular expressions to do this. From what you have described, I think the following should work.
myString = prxchange("s/(.*)( ON)$/$1 ONTARIO/",-1,strip(myString));
myString = prxchange("s/(.*)( QC)$/$1 QUEBEC/",-1,strip(myString));

Use a separate control data set to maintain the substitutions (postal code -> province) you want.
Load the control data into a hash
Process the data scanning out the last 'word'
If the word is a key in the hash then replace the word with the province value.
Presuming you are only performing transformations for a 'token' (CA postal code) as the final 'word' an example of the control data, data and transformation is as follows:
data O_Canada(label="Our home and native land");
length postal $2 province $26 ;
input postal& province&; * suffix & means data fields separated by >1 space;
datalines;
ON Ontario
QC Quebec
NS Nova Scotia
NB New Brunswick
MB Manitoba
BC British Columbia
PE Prince Edward Island
SK Saskatchewan
AB Alberta
NL Newfoundland and Labrador
;
data cities(label='Some popular places');
length place $100;
input place $CHAR50.;
datalines;
CALGARY AB
VANCOUVER BC
WINNIPEG MB
MONCTON NB
ST. JONHS NL
HALIFAX NS
TORONTO ON
MONTREAL QC
SAKATOON SK
CHARLOTTETOWN PE
WHITEHORSE YT
YELLOWKNIFE NT
IQALUIT NU
GOLDMINE YUKON
;
data cities;
modify cities;
if _n_ = 1 then do;
length postal $3 province $26; * postal 1 bigger so scanned postal will not always match;
declare hash provinces(dataset:'O_Canada');
provinces.defineKey('postal');
provinces.defineData('province');
provinces.defineDone();
call missing(postal, province);
drop postal province;
end;
postal = scan(place,-1,' ');
if provinces.find() eq 0 then do;
* this inline replacement presumes all postal codes are 2 characters;
* -1 from length will replace starting from found postal;
substr(place,length(place)-1) = province; * inline replacement;
replace;
end;
run;
Result

data GEOGRAPHY;
file datalines truncover;
informat geo $2. graphy $32.;
input geo $ graphy $;
datalines;
ON ONTARIO
QC QUEBEC
;
proc sql;
select whatever_you_want,
case graphy
when '' then myString
else substr(myString, length(myString) - length(geo)) || graphy
end as myString
from HAVE left joion GEOGRAPHY on scan(myString, -1) eq geo;
quit;

scan(myString, -1) returns the last word in myString and trim(myString) removes trailing blanks, so in a data step this does the job:
cutString = substr(myString, length(myString) - 2);
select scan(myString, -1)
when 'ON' myString = cutString || 'ONTARIO';
when 'QC' myString = cutString || 'QUEBEC';
end;
or in SQL
select case scan(myString, -1)
when 'ON' then trim(myString) || 'TARIO'
when 'QC' then substr(myString, length() - 2) || 'QUEBEC'
else myString end as myString
from YOU_KNOW_BETTER_THAN_I_DO;

#Sonny, I think the regular expression is very good. And #astel, there is another easy understading way:
data test;
InText = 'HOUSE ON THE HILL';
output;
InText = 'OTTAWA ON';
output;
run;
data _null_;
set test;
if cats(reverse(InText)) =: 'NO ' then OutText = tranwrd(InText,' ON',' ONTARIO');
put Intext = #30 OutText = ;
run;
The output will be
InText=HOUSE ON THE HILL OutText=
InText=OTTAWA ON OutText=OTTAWA ONTARIO
Reverse the variable so you can easily judge if the new variable is start with NO , that means the raw variable is end with ON. Then do the replace work by using a tranwrd() funtion.

Deleting duplicate answers containing special characters in the survey

Hello I am dealing with the following problem. I have a survey where you were able to mark several answers as well as add your own one. I am trying to get unique answers to be able to count their occurance for example: let's suppose that we have 3 answers: a, b, c. Person nr 1 marked answer a, Person nr 2 marked answer b, c, Person nr 3 marked a, c. I would like to receive the result: "a" was marked 2 times. To do that i'm trying to delete duplicate answers and create a macro-variable that stores those unique answers: a, b, c.
I have already renamed all of the survey questions to v1-v&n_que. where n_que is a macro-variable that keeps information about the number of questions in the survey. I was trying to split all of the answers into a tables (using the previous example i would get a column with the following values): a, b, c, a, c. Then i wanted to sort this data out and remove duplicates. I've tried the following:
%macro coll_ans(lib, tab);
%do _i_ = 1 %to &n_que. %by 1;
%global betav&_i_.;
proc sql noprint;
select distinct v&_i_. into :betav&_i_. separated by ', '
from &lib..&tab.
where v&_i_. ^= ' ';
quit;
data a&_i_.;
%do _j_ = 1 %to %sysfunc(countw(%quote(&&&betav&_i_.), ',')) %by 1;
text = %scan(%quote(&&&betav&_i_), &_j_., ',');
output;
%end;
run;
%end;
%mend coll_ans;
It's worth mentioning that if somebody picked more than 1 answer, for example a and b, the answers are separated with the comma, that's why i picked this separator, to unify the record.
I have tried almost everything, changing %quote to %bquote, %superq, writing && instead of &&& and i keep getting the following error (first of 40 others):
ERROR: The function NO is unknown, or cannot be accessed.
"NO" is one of the answer to the first question in the survey, full answer is: NO (go to the 9th question). It's worth mentioning that the whole survey is in polish but i am using the right encoding so i don't believe it may cause some problems (hopefully).
I will be grateful for all the advice, because I encountered an insurmountable wall

Guessing you have a data set like:
data have ;
input id v1 : $8. v2 : $8.;
cards ;
1 a a
2 b,c b
3 a,c c
;
You can transpose the data set to make it have one record per ID-variable-value.
data tran (keep=id VarName Value);
set have ;
array vars{*} v1 v2 ;
do i=1 to dim(vars) ;
Varname=vname(vars{i}) ;
do j=1 to countw(vars{i},',') ;
Value=scan(vars{i},j,',') ;
output ;
end ;
end ;
run ;
The output data set looks like:
id Varname Value
1 v1 a
1 v2 a
2 v1 b
2 v1 c
2 v2 b
3 v1 a
3 v1 c
3 v2 c
You can the use PROC FREQ or SQL to get the counts.
proc freq data=tran ;
tables varname*value/missing list ;
run ;
Outputs
Varname Value Frequency
v1 a 2
v1 b 1
v1 c 2
v2 a 1
v2 b 1
v2 c 1

First of all, it's would be better if you posted the format in which you receive the survey data as this would dictate the simplest/fastest approach overall.
Also, as a general rule, it's best to get the inputs & outputs right in non-macro SAS code then use macros to optimize the process etc. It's easier to debug that way - even for someone who's being using macros for a long time... :)
That said, from your Proc SQL code, it appears that:
a. you're receiving the answers in a single delimited text field, such as "a,b,c" or "b,c" or "a,b,z"
*** example data;
data work.answers;
length answer $10.;
input answer;
datalines;
a,b,c
a
b
b,c
NO
a,b,z
n
run;
*** example valid answer entries;
data work.valid;
length valid $10.;
input valid;
datalines;
a
b
c
NO
YES
run;
b. you want to validate each answer entry and generate counts, like:
NO 1
YES 0
a 3
b 4
c 2
Many ways to do this in SAS but for parsing tokenized text data, a de-duplicated lookup table using the hash object is handy. Code below also prints the following to the log for debugging/verification...
answer=a,b,c num_answers=3 val=a val=b val=c validated=a,b,c
answer=a num_answers=1 val=a validated=a
answer=b num_answers=1 val=b validated=b
answer=b,c num_answers=2 val=b val=c validated=b,c
answer=NO num_answers=1 val=NO validated=NO
answer=a,b,z num_answers=3 val=a val=b val=z -invalid validated=a,b, validated=a,b,
answer=n num_answers=1 val=n -invalid validated= validated=
Once you've mastered the declaration syntax for the hash object, it's quite logical and relatively fast. And of course you can add validation rules - such as upper case & lower case entries ...
*** first, de-duplicate your lookup table. ;
proc sort data=work.valid nodupkey;
by valid;
run;
data _null_;
length valid $10. answer_count 4. count 4. validated $10.;
retain count 0;
*** initialize & load hash object ;
if _N_ = 1 then do;
declare hash h(multidata: 'n', ordered: 'y');
rc = h.defineKey('valid');
rc = h.defineData('valid','count');
rc = h.defineDone();
do until(eof1);
set work.valid end=eof1;
h.add();
end;
end;
*** now process questions/answers;
do until(eof);
*** read each answer;
set answers end=eof;
num_answers=countw(answer);
putlog answer= num_answers= #;
*** parse each answer entry;
validated=answer;
do i=1 to num_answers;
val=scan(answer,i);
putlog val= #;
*** (optional) keep track of total #answers: valid + invalid;
answer_count+1;
*** check answer entry in lookup table;
rc= h.find(key:val);
*** if entry NOT in lookup table, remove from validated answer;
if rc ne 0 then do;
putlog "-invalid " #;
validated=tranwrd(validated,trim(val),' ');
end;
*** if answer found, increment counter in lookup table;
else do;
count+1;
h.replace();
end;
end;
putlog validated=;
end;
*** save table of answer counts to disk;
if eof then h.output(dataset: 'work.counts');
run;

multiple numbers at different locations in a string variable

I have 7 digit numbers in a string variable but at different column for each record. below is the example of my data
ser string
101 purchase items id: 1013456
102 entry no: 2017685
103 id: 1897654 item
.....
.....
My requirement is to create a new variable with just the numeric from string variable. Output should look like this
ser number
101 1013456
102 2017685
103 1897654
I have the list of the numbers which can be created as a macro variable
%let num=1013456,2017685,1897654
I have used scan and substr functions but didn't get the desired result
I would appreciate any solution for this. Thanks

Try using the compress() function to remove the unwanted characters and the input() function to convert to a number.
data want;
set have;
number = input(compress(string,':','as'),7.);
drop string;
run;
The second argument to compress explicitly removes the : character. The as modifier removes alphabetic characters (a) and space characters (s).

You can use a regular expression to extract the numbers. Check out the PRX functions in SAS.
Here's an example of how to accomplish your goal using a regular expression:
data inData;
length ser 8 string $100;
ser = 101;
string = 'purchase items id: 1013456';
output;
ser = 102;
string = 'entry no: 2017685';
output;
ser = 103;
string = 'id: 1897654 item';
output;
run;
data outData;
length ser 8 number $7;
retain re;
set inData;
if _n_ = 1 then do;
re = prxparse("/.*(\d{7}).*/");
end;
if prxmatch(re, string) then do;
number = prxposn(re, 1, string);
end;
keep ser number;
run;

A slightly simpler regular expression approach using only prxmatch:
data have;
input ser string $50.;
cards;
101 purchase items id: 1013456
102 entry no: 2017685
103 id: 1897654 item
;
run;
data want;
set have;
num = input(substr(string,prxmatch('/\d{7}/',string),7),8.);
run;
This will only match the first 7-digit number in string. By contrast, if it contains any other numbers then the compress approach will concatenate all of them.

character position in string

I have a data frame with character strings in column1 and ID in column2. The string contains A,T,G or C.
I would like to print the lines that have an A at position 1.
Then I would like to print the lines that have A at position 2 and so on and save them in separate files.
So far I have used biostrings in R for similar analysis, but it won't work for this problem exactly. I would like to use perl.
Sequence ID
TATACAAGGGCAAGCTCTCTGT mmu-miR-381-3p
TCGGATCCGTCTGAGCT mmu-miR-127-3p
ATAGTAGACCGTATAGCGTACG mmu-miR-411-5p
......
600 more lines

Biostrings will work perfectly, and will be pretty fast. Let's call your DNA stringset mydata
HasA <- sapply(mydata,function(x) as.character(x[2]) == "A")
Now you have a vector of TRUE or FALSE indicating which sequence has an A at position 2. You can make that into a nice data frame like this
HasA.df <- data.frame("SeqName" = names(mydata), "A_at_2" = HasA)

Not sure about the expected result,
mydata <- read.table(text="Sequence ID
TATACAAGGGCAAGCTCTCTGT mmu-miR-381-3p
TCGGATCCGTCTGAGCT mmu-miR-127-3p
ATAGTAGACCGTATAGCGTACG mmu-miR-411-5p",sep="",header=T,stringsAsFactors=F)
mCh <- max(nchar(mydata[,1])) #gives the maximum number of characters in the first column
sapply(seq(mCh), function(i) substr(mydata[,1],i,i)=="A") #gives the index
You can use which to get the index of the row that satisfies the condition for each position
res <- stack(setNames(sapply(seq(mCh),
function(i) which(substr(mydata[,1],i,i)=="A")),1:mCh))[,2:1]
tail(res, 5) #for the 13th position, 1st and 3rd row of the sequence are TRUE
ind values
#11 13 1
#12 13 3
#13 14 2
#14 15 3
#15 20 3
use the index values to extract the rows. For the 1st position
mydata[res$values[res$ind==1],]
# Sequence ID
# 3 ATAGTAGACCGTATAGCGTACG mmu-miR-411-5p

Using a perl one-liner
perl -Mautodie -lane '
BEGIN {($f) = #ARGV}
next if $. == 1;
my #c = split //, $F[0];
for my $i (grep {$c[$_] eq "A"} (0..$#c)) {
open my $fh, ">>", "$f.$i";
print $fh $_;
}
' file

How to remove any trailing numbers from a string?

Sample inputs:
"Hi there how are you"
"What is the #1 pizza place in NYC?"
"Dominoes is number 1"
"Blah blah 123123"
"More blah 12321 123123 123132"
Expected output:
"Hi there how are you"
"What is the #1 pizza place in NYC?"
"Dominoes is number"
"Blah blah"
"More blah"
I'm thinking it's a 2 step process:
Split the entire string into characters, one row per character (including spaces), in reverse order
Loop through, and for each one if it's a space or a number, skip, otherwise add to the start of another array.
And i should end up with the desired result.
I can think of a few quick and dirty ways, but this needs to perform fairly well, as it's a trigger that runs on a busy table, so thought i'd throw it out to the T-SQL pros.
Any suggestions?

This solution should be a bit more efficient because it first checks to see if the string contains a number, then it checks to see if the string ends in a number.
CREATE FUNCTION dbo.trim_ending_numbers(#columnvalue AS VARCHAR(100)) RETURNS VARCHAR(100)
BEGIN
--This will make the query more efficient by first checking to see if it contains any numbers at all
IF #columnvalue NOT LIKE '%[0-9]%'
RETURN #columnvalue
DECLARE #counter INT
SET #counter = LEN(#columnvalue)
IF ISNUMERIC(SUBSTRING(#columnvalue,#counter,1)) = 0
RETURN #columnvalue
WHILE ISNUMERIC(SUBSTRING(#columnvalue,#counter,1)) = 1 OR SUBSTRING(#columnvalue,#counter,1) = ' '
BEGIN
SET #counter = #counter -1
IF #counter < 0
BREAK
END
SET #columnvalue = SUBSTRING(#columnvalue,0,#counter+1)
RETURN #columnvalue
END
If you run
SELECT dbo.trim_ending_numbers('More blah 12321 123123 123132')
It will return
'More blah'

A loop on a busy table will be very unlikely to perform adequately. Use REVERSE and PATINDEX to find the first non digit, begin a SUBSTRING there, then REVERSE the result. This will be plenty slow with no loops.
Your examples imply that you also don't want to match spaces.
DECLARE #t TABLE (s NVARCHAR(500))
INSERT INTO #t (s)
VALUES
('Hi there how are you'),('What is the #1 pizza place in NYC?'),('Dominoes is number 1'),('Blah blah 123123'),('More blah 12321 123123 123132')
select s
, reverse(s) as beginning
, patindex('%[^0-9 ]%',reverse(s)) as progress
, substring(reverse(s),patindex('%[^0-9 ]%',reverse(s)), 1+len(s)-patindex('%[^0-9 ]%',reverse(s))) as [more progress]
, reverse(substring(reverse(s),patindex('%[^0-9 ]%',reverse(s)), 1+len(s)-patindex('%[^0-9 ]%',reverse(s)))) as SOLUTION
from #t
Final answer:
reverse( substring( reverse( #s ), patindex( '%[^0-9 ]%', reverse( #s ) ), 1 + len( #s ) - patindex( '%[^0-9 ]%', reverse( #s ) ) ) )

I believe that the below query is fast and useful
select reverse(substring(reverse(colA),PATINDEX('%[0-9][a-z]%',reverse(colA))+1,
len(colA)-PATINDEX('%[0-9][a-z]%',reverse(colA))))
from TBLA

--DECLARE #String VARCHAR(100) = 'the fat cat sat on the mat'
--DECLARE #String VARCHAR(100) = 'the fat cat 2 sat33 on4 the mat'
--DECLARE #String VARCHAR(100) = 'the fat cat sat on the mat1'
--DECLARE #String VARCHAR(100) = '2121'
DECLARE #String VARCHAR(100) = 'the fat cat 2 2 2 2 sat on the mat2121'
DECLARE #Answer NVARCHAR(MAX),
#Index INTEGER = LEN(#String),
#Character CHAR,
#IncorrectCharacterIndex SMALLINT
-- Start from the end, going to the front.
WHILE #Index > 0 BEGIN
-- Get each character, starting from the end
SET #Character = SUBSTRING(#String, #Index, 1)
-- Regex check.
SET #IncorrectCharacterIndex = PATINDEX('%[A-Za-z-]%', #Character)
-- Is there a match? We're lucky here because it will either match on index 1 or not (index 0)
IF (#IncorrectCharacterIndex != 0)
BEGIN
-- We have a legit character.
SET #Answer = SUBSTRING(#String, 0, #Index + 1)
SET #Index = 0
END
ELSE
SET #Index = #Index - 1 -- No match, lets go back one index slot.
END
PRINT LTRIM(RTRIM(#Answer))
NOTE: I've included a dash in the valid regex match.

Thanks for all the contributions which were very helpful. To go further and extract off JUST the trailing number:
, substring(s, 2 + len(s) - patindex('%[^0-9 ]%',reverse(s)), 99) as numeric_suffix
I needed to sort on the number suffix so had to restrict the pattern to numerics and to get around numbers of different lengths sorting as text (ie I wanted 2 to sort before 19) cast the result:
,cast(substring(s, 2 + len(s) - patindex('%[^0-9]%',reverse(s)),99) as integer) as numeric_suffix

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

GREP-like function to retrieve text in SAS - string

Related

Find and replace acronyms by full province names in SAS

Deleting duplicate answers containing special characters in the survey

multiple numbers at different locations in a string variable

character position in string

How to remove any trailing numbers from a string?

Categories

Resources