Find and replace acronyms by full province names in SAS - string

I need to replace a string in a SAS dataset in the following way :
OTTAWA ON should be replaced with OTTAWA ONTARIO
WHATEVER QC should be replaced with WHATEVER QUEBEC
etc.
However, HOUSE ON THE HILL should not become HOUSE ONTARIO THE HILL.
That is, I want to replace all instances of ON with ONTARIO but only if ON exists as the last word in the string

You could use regular expressions to do this. From what you have described, I think the following should work.
myString = prxchange("s/(.*)( ON)$/$1 ONTARIO/",-1,strip(myString));
myString = prxchange("s/(.*)( QC)$/$1 QUEBEC/",-1,strip(myString));

Use a separate control data set to maintain the substitutions (postal code -> province) you want.
Load the control data into a hash
Process the data scanning out the last 'word'
If the word is a key in the hash then replace the word with the province value.
Presuming you are only performing transformations for a 'token' (CA postal code) as the final 'word' an example of the control data, data and transformation is as follows:
data O_Canada(label="Our home and native land");
length postal $2 province $26 ;
input postal& province&; * suffix & means data fields separated by >1 space;
datalines;
ON Ontario
QC Quebec
NS Nova Scotia
NB New Brunswick
MB Manitoba
BC British Columbia
PE Prince Edward Island
SK Saskatchewan
AB Alberta
NL Newfoundland and Labrador
;
data cities(label='Some popular places');
length place $100;
input place $CHAR50.;
datalines;
CALGARY AB
VANCOUVER BC
WINNIPEG MB
MONCTON NB
ST. JONHS NL
HALIFAX NS
TORONTO ON
MONTREAL QC
SAKATOON SK
CHARLOTTETOWN PE
WHITEHORSE YT
YELLOWKNIFE NT
IQALUIT NU
GOLDMINE YUKON
;
data cities;
modify cities;
if _n_ = 1 then do;
length postal $3 province $26; * postal 1 bigger so scanned postal will not always match;
declare hash provinces(dataset:'O_Canada');
provinces.defineKey('postal');
provinces.defineData('province');
provinces.defineDone();
call missing(postal, province);
drop postal province;
end;
postal = scan(place,-1,' ');
if provinces.find() eq 0 then do;
* this inline replacement presumes all postal codes are 2 characters;
* -1 from length will replace starting from found postal;
substr(place,length(place)-1) = province; * inline replacement;
replace;
end;
run;
Result

data GEOGRAPHY;
file datalines truncover;
informat geo $2. graphy $32.;
input geo $ graphy $;
datalines;
ON ONTARIO
QC QUEBEC
;
proc sql;
select whatever_you_want,
case graphy
when '' then myString
else substr(myString, length(myString) - length(geo)) || graphy
end as myString
from HAVE left joion GEOGRAPHY on scan(myString, -1) eq geo;
quit;

scan(myString, -1) returns the last word in myString and trim(myString) removes trailing blanks, so in a data step this does the job:
cutString = substr(myString, length(myString) - 2);
select scan(myString, -1)
when 'ON' myString = cutString || 'ONTARIO';
when 'QC' myString = cutString || 'QUEBEC';
end;
or in SQL
select case scan(myString, -1)
when 'ON' then trim(myString) || 'TARIO'
when 'QC' then substr(myString, length() - 2) || 'QUEBEC'
else myString end as myString
from YOU_KNOW_BETTER_THAN_I_DO;

#Sonny, I think the regular expression is very good. And #astel, there is another easy understading way:
data test;
InText = 'HOUSE ON THE HILL';
output;
InText = 'OTTAWA ON';
output;
run;
data _null_;
set test;
if cats(reverse(InText)) =: 'NO ' then OutText = tranwrd(InText,' ON',' ONTARIO');
put Intext = #30 OutText = ;
run;
The output will be
InText=HOUSE ON THE HILL OutText=
InText=OTTAWA ON OutText=OTTAWA ONTARIO
Reverse the variable so you can easily judge if the new variable is start with NO , that means the raw variable is end with ON. Then do the replace work by using a tranwrd() funtion.

Related

How to separate a string by Capital Letter?

I currently have to a code in ABAP which contains a String that has multiple words that start with Capital letters/Uppercase and there is no space in-between.
I have to separate it into an internal table like this:
INPUT :
NameAgeAddress
OUTPUT :
Name
Age
Address
Here is the shortest code I could find, which uses a regular expression combined with SPLIT:
SPLIT replace( val = 'NameAgeAddress' regex = `(?!^.)\u` with = ` $0` occ = 0 )
AT ` `
INTO TABLE itab.
So, replace converts 'NameAgeAddress' into 'Name Age Address' and SPLIT puts the 3 words into an internal table.
Details:
(?!^.) to say the next character to find (\u) should not be the first character
\u being any upper case letter
$0 to replace the found string ($0) by itself preceded with a space character
occ = 0 to replace all occurrences
Unfortunately, the SPLIT statement in ABAP does not allow a regex as separator expression. Therefore, we have to use progressive matching, which is a bit awkward in ABAP:
report zz_test_split_capital.
parameters: p_input type string default 'NameAgeAddress' lower case.
data: output type stringtab,
off type i,
moff type i,
mlen type i.
while off < strlen( p_input ).
find regex '[A-Z][^A-Z]*'
in section offset off of p_input
match offset moff match length mlen.
if sy-subrc eq 0.
append substring( val = p_input off = moff len = mlen ) to output.
off = moff + mlen.
else.
exit.
endif.
endwhile.
cl_demo_output=>display_data( output ).
Just for comparison, the following statement would do the job in Perl:
my $input = "NameAgeAddress";
my #output = split /(?=[A-Z])/, $input;
# gives #output = ('Name','Age','Address')
It is easy with using regular expressions. The solution could look like this.
REPORT ZZZ.
DATA: g_string TYPE string VALUE `NameAgeAddress`.
DATA(gcl_regex) = NEW cl_abap_regex( pattern = `[A-Z]{1}[a-z]+` ).
DATA(gcl_matcher) = gcl_regex->create_matcher( text = g_string ).
WHILE gcl_matcher->find_next( ).
DATA(g_match_result) = gcl_matcher->get_match( ).
WRITE / g_string+g_match_result-offset(g_match_result-length).
ENDWHILE.
For when regular expressions are just overkill and plain old ABAP will do:
DATA(str) = 'NameAgeAddress'.
IF str CA sy-abcde.
DATA(off) = 0.
DO.
data(tailstart) = off + 1.
IF str+tailstart CA sy-abcde.
DATA(len) = sy-fdpos + 1.
WRITE: / str+off(len).
add len to off.
ELSE.
EXIT.
ENDIF.
ENDDO.
write / str+off.
ENDIF.
If you do not want to use or cannot use Regex, here another solution:
DATA: lf_input TYPE string VALUE 'NameAgeAddress',
lf_offset TYPE i,
lf_current_letter TYPE char1,
lf_letter_in_capital TYPE char1,
lf_word TYPE string,
lt_word LIKE TABLE OF lf_word.
DO strlen( lf_input ) TIMES.
lf_offset = sy-index - 1.
lf_current_letter = lf_input+lf_offset(1).
lf_letter_in_capital = to_upper( lf_current_letter ).
IF lf_current_letter = lf_letter_in_capital.
APPEND INITIAL LINE TO lt_word ASSIGNING FIELD-SYMBOL(<ls_word>).
ENDIF.
IF <ls_word> IS ASSIGNED. "if input string does not start with capital letter
<ls_word> = <ls_word> && lf_current_letter.
ENDIF.
ENDDO.

GREP-like function to retrieve text in SAS

I want to retrieve specific text within a column in a SAS file.
The file would like the following:
Patient Location infoTxt
001 B Admission Code: 123456 X
Exit Code: 98765W
002 C Admission Code: 4567 WY
Exit Code: 76543Z
003 D Admission Code: 67890 L
Exit Code: 4321Z
I want to retrieve just the information after the colon for Admission Code and Exit Code and put them in their own columns. The 'codes' can be any combination of letters, numbers, and blank spaces. The new data would look like the following:
Patient Location AdmissionCode ExitCode
001 B 123456 X 8765W
002 C 4567 WY 76543Z
003 D 67890 L 4321Z
I'm not familiar with the functions in SAS, but maybe the logic would look something like the following:
data want;
set have;
do i = 1 to dim(infoTxt)
AdmissionCode = substring(string1, regexpr(":", string) + 1);
ExitCode = substring(string2, regexpr(":", string) + 1);
run;
In the code above, string1 would represent the first line of text in infoTxt and string2 would represent the second line of text infoTxt.
SAS can utilize Perl regular expressions through the family of functions that start with PRX. The tip sheet is a great summary if you are familiar with regular expressions.
PRXMATCH and PRXPOSN can test a regex pattern with capture groups and retrieve the group text.
data have;
input;
text = _infile_;
datalines;
Admission Code: 123456 X Exit Code: 98765W
Admission Code: 4567 WY Exit Code: 76543Z
Admission Code: 67890 L Exit Code: 4321Z
run;
data want;
set have;
if _n_ = 1 then do;
retain rx;
rx = prxparse ('/Admission Code: (.*)Exit Code:(.*)/');
end;
length AdmissionCode ExitCode $50;
if prxmatch(rx,text) then do;
AdmissionCode = prxposn(rx, 1, text);
ExitCode = prxposn(rx, 2, text);
end;
drop rx;
run;
I like a RegEX with a capture buffer as much as the next guy but you could also use input statement features to read this data.
data info;
infile cards n=2 firstobs=2;
input #1 patient:$3. location :$1. #'Admission Code: ' AdmissionCode &$16. #2 #'Exit Code: ' ExitCode &$16.;
cards;
Patient Location infoTxt
001 B Admission Code: 123456 X
Exit Code: 98765W
002 C Admission Code: 4567 WY
Exit Code: 76543Z
003 D Admission Code: 67890 L
Exit Code: 4321Z
;;;;
run;
proc print;
run;
There may be a solution out there that does it all in one data step. This creates two steps to deal with the admission and exit being on different rows-- first a data step, then a join to get it back together.
SAS does have regex syntax but I used SAS character functions instead. substr has 3 arguments, string, start position, and end position-- but end position is optional and I've omitted it to tell it to grab everything after the start position. retain is used to fill in the patient and location in the second row of each group.
data admission exit;
set grep;
retain patient2 location2;
if patient ne '' then do;
patient2=patient;
location2=location;
admissioncode=substr(infoTxt,find(infoTxt,":")+2);
output admission;
end;
else do;
exitcode=substr(infoTxt,find(infoTxt,":")+2);
output exit;
end;
run;
proc sql;
create table dat as select a.patient2 as patient,a.location2 as location,a.admissioncode,b.exitcode
from admission a
left join exit b on a.patient2=b.patient2 and a.location2=b.location2
;
quit;
Provided that you always have the same pattern of colons and line breaks, I think you can do this with scan:
admission_code = scan(infoTxt, 2, '3A0A0D'x);
exit_code = scan(infoTxt, 4, '3A0A0D'x);
This uses the hex literal '3A0A0D'x to specify :, line feed and carriage return as delimiters for the scan function.

multiple numbers at different locations in a string variable

I have 7 digit numbers in a string variable but at different column for each record. below is the example of my data
ser string
101 purchase items id: 1013456
102 entry no: 2017685
103 id: 1897654 item
.....
.....
My requirement is to create a new variable with just the numeric from string variable. Output should look like this
ser number
101 1013456
102 2017685
103 1897654
I have the list of the numbers which can be created as a macro variable
%let num=1013456,2017685,1897654
I have used scan and substr functions but didn't get the desired result
I would appreciate any solution for this. Thanks
Try using the compress() function to remove the unwanted characters and the input() function to convert to a number.
data want;
set have;
number = input(compress(string,':','as'),7.);
drop string;
run;
The second argument to compress explicitly removes the : character. The as modifier removes alphabetic characters (a) and space characters (s).
You can use a regular expression to extract the numbers. Check out the PRX functions in SAS.
Here's an example of how to accomplish your goal using a regular expression:
data inData;
length ser 8 string $100;
ser = 101;
string = 'purchase items id: 1013456';
output;
ser = 102;
string = 'entry no: 2017685';
output;
ser = 103;
string = 'id: 1897654 item';
output;
run;
data outData;
length ser 8 number $7;
retain re;
set inData;
if _n_ = 1 then do;
re = prxparse("/.*(\d{7}).*/");
end;
if prxmatch(re, string) then do;
number = prxposn(re, 1, string);
end;
keep ser number;
run;
A slightly simpler regular expression approach using only prxmatch:
data have;
input ser string $50.;
cards;
101 purchase items id: 1013456
102 entry no: 2017685
103 id: 1897654 item
;
run;
data want;
set have;
num = input(substr(string,prxmatch('/\d{7}/',string),7),8.);
run;
This will only match the first 7-digit number in string. By contrast, if it contains any other numbers then the compress approach will concatenate all of them.

How to remove any trailing numbers from a string?

Sample inputs:
"Hi there how are you"
"What is the #1 pizza place in NYC?"
"Dominoes is number 1"
"Blah blah 123123"
"More blah 12321 123123 123132"
Expected output:
"Hi there how are you"
"What is the #1 pizza place in NYC?"
"Dominoes is number"
"Blah blah"
"More blah"
I'm thinking it's a 2 step process:
Split the entire string into characters, one row per character (including spaces), in reverse order
Loop through, and for each one if it's a space or a number, skip, otherwise add to the start of another array.
And i should end up with the desired result.
I can think of a few quick and dirty ways, but this needs to perform fairly well, as it's a trigger that runs on a busy table, so thought i'd throw it out to the T-SQL pros.
Any suggestions?
This solution should be a bit more efficient because it first checks to see if the string contains a number, then it checks to see if the string ends in a number.
CREATE FUNCTION dbo.trim_ending_numbers(#columnvalue AS VARCHAR(100)) RETURNS VARCHAR(100)
BEGIN
--This will make the query more efficient by first checking to see if it contains any numbers at all
IF #columnvalue NOT LIKE '%[0-9]%'
RETURN #columnvalue
DECLARE #counter INT
SET #counter = LEN(#columnvalue)
IF ISNUMERIC(SUBSTRING(#columnvalue,#counter,1)) = 0
RETURN #columnvalue
WHILE ISNUMERIC(SUBSTRING(#columnvalue,#counter,1)) = 1 OR SUBSTRING(#columnvalue,#counter,1) = ' '
BEGIN
SET #counter = #counter -1
IF #counter < 0
BREAK
END
SET #columnvalue = SUBSTRING(#columnvalue,0,#counter+1)
RETURN #columnvalue
END
If you run
SELECT dbo.trim_ending_numbers('More blah 12321 123123 123132')
It will return
'More blah'
A loop on a busy table will be very unlikely to perform adequately. Use REVERSE and PATINDEX to find the first non digit, begin a SUBSTRING there, then REVERSE the result. This will be plenty slow with no loops.
Your examples imply that you also don't want to match spaces.
DECLARE #t TABLE (s NVARCHAR(500))
INSERT INTO #t (s)
VALUES
('Hi there how are you'),('What is the #1 pizza place in NYC?'),('Dominoes is number 1'),('Blah blah 123123'),('More blah 12321 123123 123132')
select s
, reverse(s) as beginning
, patindex('%[^0-9 ]%',reverse(s)) as progress
, substring(reverse(s),patindex('%[^0-9 ]%',reverse(s)), 1+len(s)-patindex('%[^0-9 ]%',reverse(s))) as [more progress]
, reverse(substring(reverse(s),patindex('%[^0-9 ]%',reverse(s)), 1+len(s)-patindex('%[^0-9 ]%',reverse(s)))) as SOLUTION
from #t
Final answer:
reverse( substring( reverse( #s ), patindex( '%[^0-9 ]%', reverse( #s ) ), 1 + len( #s ) - patindex( '%[^0-9 ]%', reverse( #s ) ) ) )
I believe that the below query is fast and useful
select reverse(substring(reverse(colA),PATINDEX('%[0-9][a-z]%',reverse(colA))+1,
len(colA)-PATINDEX('%[0-9][a-z]%',reverse(colA))))
from TBLA
--DECLARE #String VARCHAR(100) = 'the fat cat sat on the mat'
--DECLARE #String VARCHAR(100) = 'the fat cat 2 sat33 on4 the mat'
--DECLARE #String VARCHAR(100) = 'the fat cat sat on the mat1'
--DECLARE #String VARCHAR(100) = '2121'
DECLARE #String VARCHAR(100) = 'the fat cat 2 2 2 2 sat on the mat2121'
DECLARE #Answer NVARCHAR(MAX),
#Index INTEGER = LEN(#String),
#Character CHAR,
#IncorrectCharacterIndex SMALLINT
-- Start from the end, going to the front.
WHILE #Index > 0 BEGIN
-- Get each character, starting from the end
SET #Character = SUBSTRING(#String, #Index, 1)
-- Regex check.
SET #IncorrectCharacterIndex = PATINDEX('%[A-Za-z-]%', #Character)
-- Is there a match? We're lucky here because it will either match on index 1 or not (index 0)
IF (#IncorrectCharacterIndex != 0)
BEGIN
-- We have a legit character.
SET #Answer = SUBSTRING(#String, 0, #Index + 1)
SET #Index = 0
END
ELSE
SET #Index = #Index - 1 -- No match, lets go back one index slot.
END
PRINT LTRIM(RTRIM(#Answer))
NOTE: I've included a dash in the valid regex match.
Thanks for all the contributions which were very helpful. To go further and extract off JUST the trailing number:
, substring(s, 2 + len(s) - patindex('%[^0-9 ]%',reverse(s)), 99) as numeric_suffix
I needed to sort on the number suffix so had to restrict the pattern to numerics and to get around numbers of different lengths sorting as text (ie I wanted 2 to sort before 19) cast the result:
,cast(substring(s, 2 + len(s) - patindex('%[^0-9]%',reverse(s)),99) as integer) as numeric_suffix

Capitalize / Capitalise first letter of every word in a string in Matlab?

What's the best way to capitalize / capitalise the first letter of every word in a string in Matlab?
i.e.
the rain in spain falls mainly on the plane
to
The Rain In Spain Falls Mainly On The Plane
So using the string
str='the rain in spain falls mainly on the plain.'
Simply use regexp replacement function in Matlab, regexprep
regexprep(str,'(\<[a-z])','${upper($1)}')
ans =
The Rain In Spain Falls Mainly On The Plain.
The \<[a-z] matches the first character of each word to which you can convert to upper case using ${upper($1)}
This will also work using \<\w to match the character at the start of each word.
regexprep(str,'(\<\w)','${upper($1)}')
Since Matlab comes with build in Perl, for every complicated string or file processing tasks Perl scripts can be used. So you could maybe use something like this:
[result, status] = perl('capitalize.pl','the rain in Spain falls mainly on the plane')
where capitalize.pl is a Perl script as follows:
$input = $ARGV[0];
$input =~ s/([\w']+)/\u\L$1/g;
print $input;
The perl code was taken from this Stack Overflow question.
Loads of ways:
str = 'the rain in Spain falls mainly on the plane'
spaceInd = strfind(str, ' '); % assume a word is preceded by a space
startWordInd = spaceInd+1; % words start 1 char after a space
startWordInd = [1, startWordInd]; % manually add the first word
capsStr = upper(str);
newStr = str;
newStr(startWordInd) = capsStr(startWordInd)
More elegant/complex -- cell-arrays, textscan and cellfun are very useful for this kind of thing:
str = 'the rain in Spain falls mainly on the plane'
function newStr = capitals(str)
words = textscan(str,'%s','delimiter',' '); % assume a word is preceded by a space
words = words{1};
newWords = cellfun(#my_fun_that_capitalizes, words, 'UniformOutput', false);
newStr = [newWords{:}];
function wOut = my_fun_that_capitalizes(wIn)
wOut = [wIn ' ']; % add the space back that we used to split upon
if numel(wIn)>1
wOut(1) = upper(wIn(1));
end
end
end
str='the rain in spain falls mainly on the plain.' ;
for i=1:length(str)
if str(i)>='a' && str(i)<='z'
if i==1 || str(i-1)==' '
str(i)=char(str(i)-32); % 32 is the ascii distance between uppercase letters and its lowercase equivalents
end
end
end
Less ellegant and efficient, more readable and maintainable.

Resources