I have 7 digit numbers in a string variable but at different column for each record. below is the example of my data
ser string
101 purchase items id: 1013456
102 entry no: 2017685
103 id: 1897654 item
.....
.....
My requirement is to create a new variable with just the numeric from string variable. Output should look like this
ser number
101 1013456
102 2017685
103 1897654
I have the list of the numbers which can be created as a macro variable
%let num=1013456,2017685,1897654
I have used scan and substr functions but didn't get the desired result
I would appreciate any solution for this. Thanks
Try using the compress() function to remove the unwanted characters and the input() function to convert to a number.
data want;
set have;
number = input(compress(string,':','as'),7.);
drop string;
run;
The second argument to compress explicitly removes the : character. The as modifier removes alphabetic characters (a) and space characters (s).
You can use a regular expression to extract the numbers. Check out the PRX functions in SAS.
Here's an example of how to accomplish your goal using a regular expression:
data inData;
length ser 8 string $100;
ser = 101;
string = 'purchase items id: 1013456';
output;
ser = 102;
string = 'entry no: 2017685';
output;
ser = 103;
string = 'id: 1897654 item';
output;
run;
data outData;
length ser 8 number $7;
retain re;
set inData;
if _n_ = 1 then do;
re = prxparse("/.*(\d{7}).*/");
end;
if prxmatch(re, string) then do;
number = prxposn(re, 1, string);
end;
keep ser number;
run;
A slightly simpler regular expression approach using only prxmatch:
data have;
input ser string $50.;
cards;
101 purchase items id: 1013456
102 entry no: 2017685
103 id: 1897654 item
;
run;
data want;
set have;
num = input(substr(string,prxmatch('/\d{7}/',string),7),8.);
run;
This will only match the first 7-digit number in string. By contrast, if it contains any other numbers then the compress approach will concatenate all of them.
Related
I need to replace a string in a SAS dataset in the following way :
OTTAWA ON should be replaced with OTTAWA ONTARIO
WHATEVER QC should be replaced with WHATEVER QUEBEC
etc.
However, HOUSE ON THE HILL should not become HOUSE ONTARIO THE HILL.
That is, I want to replace all instances of ON with ONTARIO but only if ON exists as the last word in the string
You could use regular expressions to do this. From what you have described, I think the following should work.
myString = prxchange("s/(.*)( ON)$/$1 ONTARIO/",-1,strip(myString));
myString = prxchange("s/(.*)( QC)$/$1 QUEBEC/",-1,strip(myString));
Use a separate control data set to maintain the substitutions (postal code -> province) you want.
Load the control data into a hash
Process the data scanning out the last 'word'
If the word is a key in the hash then replace the word with the province value.
Presuming you are only performing transformations for a 'token' (CA postal code) as the final 'word' an example of the control data, data and transformation is as follows:
data O_Canada(label="Our home and native land");
length postal $2 province $26 ;
input postal& province&; * suffix & means data fields separated by >1 space;
datalines;
ON Ontario
QC Quebec
NS Nova Scotia
NB New Brunswick
MB Manitoba
BC British Columbia
PE Prince Edward Island
SK Saskatchewan
AB Alberta
NL Newfoundland and Labrador
;
data cities(label='Some popular places');
length place $100;
input place $CHAR50.;
datalines;
CALGARY AB
VANCOUVER BC
WINNIPEG MB
MONCTON NB
ST. JONHS NL
HALIFAX NS
TORONTO ON
MONTREAL QC
SAKATOON SK
CHARLOTTETOWN PE
WHITEHORSE YT
YELLOWKNIFE NT
IQALUIT NU
GOLDMINE YUKON
;
data cities;
modify cities;
if _n_ = 1 then do;
length postal $3 province $26; * postal 1 bigger so scanned postal will not always match;
declare hash provinces(dataset:'O_Canada');
provinces.defineKey('postal');
provinces.defineData('province');
provinces.defineDone();
call missing(postal, province);
drop postal province;
end;
postal = scan(place,-1,' ');
if provinces.find() eq 0 then do;
* this inline replacement presumes all postal codes are 2 characters;
* -1 from length will replace starting from found postal;
substr(place,length(place)-1) = province; * inline replacement;
replace;
end;
run;
Result
data GEOGRAPHY;
file datalines truncover;
informat geo $2. graphy $32.;
input geo $ graphy $;
datalines;
ON ONTARIO
QC QUEBEC
;
proc sql;
select whatever_you_want,
case graphy
when '' then myString
else substr(myString, length(myString) - length(geo)) || graphy
end as myString
from HAVE left joion GEOGRAPHY on scan(myString, -1) eq geo;
quit;
scan(myString, -1) returns the last word in myString and trim(myString) removes trailing blanks, so in a data step this does the job:
cutString = substr(myString, length(myString) - 2);
select scan(myString, -1)
when 'ON' myString = cutString || 'ONTARIO';
when 'QC' myString = cutString || 'QUEBEC';
end;
or in SQL
select case scan(myString, -1)
when 'ON' then trim(myString) || 'TARIO'
when 'QC' then substr(myString, length() - 2) || 'QUEBEC'
else myString end as myString
from YOU_KNOW_BETTER_THAN_I_DO;
#Sonny, I think the regular expression is very good. And #astel, there is another easy understading way:
data test;
InText = 'HOUSE ON THE HILL';
output;
InText = 'OTTAWA ON';
output;
run;
data _null_;
set test;
if cats(reverse(InText)) =: 'NO ' then OutText = tranwrd(InText,' ON',' ONTARIO');
put Intext = #30 OutText = ;
run;
The output will be
InText=HOUSE ON THE HILL OutText=
InText=OTTAWA ON OutText=OTTAWA ONTARIO
Reverse the variable so you can easily judge if the new variable is start with NO , that means the raw variable is end with ON. Then do the replace work by using a tranwrd() funtion.
How can I, in ABAP, split a string into n parts AND determine which one is the biggest element? In my solution I would need to know how many elements there are, but I want to solve it for WHATEVER NUMBER of elements.
I tried the below code. And i searched the web.
DATA: string TYPE string VALUE 'this is a string'.
DATA: part1 TYPE c LENGTH 20.
DATA: part2 TYPE c LENGTH 20.
DATA: part3 TYPE c LENGTH 20.
DATA: part4 TYPE c LENGTH 20.
DATA: del TYPE c VALUE ' '.
DATA: bigger TYPE c LENGTH 20.
split: string AT del INTO part1 part2 part3 part4.
bigger = part1.
IF bigger > part2.
bigger = part1.
ELSEIF bigger > part3.
bigger = part2.
ELSE.
bigger = part4.
ENDIF.
WRITE: bigger.
Expected result: Works with any number of elements in a string and determines which one is biggest.
Actual result: I need to know how many elements there are
Here is one way to solve it:
DATA: string TYPE string VALUE 'this is a string'.
TYPES: BEGIN OF ty_words,
word TYPE string,
length TYPE i,
END OF ty_words.
DATA: ls_words TYPE ty_words.
DATA: gt_words TYPE STANDARD TABLE OF ty_words.
START-OF-SELECTION.
WHILE string IS NOT INITIAL.
SPLIT string AT space INTO ls_words-word string.
ls_words-length = strlen( ls_words-word ).
APPEND ls_words TO gt_words.
ENDWHILE.
SORT gt_words BY length DESCENDING.
READ TABLE gt_words
ASSIGNING FIELD-SYMBOL(<ls_longest_word>)
INDEX 1.
IF sy-subrc EQ 0.
WRITE: 'The longest word is:', <ls_longest_word>-word.
ENDIF.
Please note, it does not cover the case if there are more longest words with the same length, it will just show one of them.
You don't need to know the number of splitted parts if you split the string into an array. Then you LOOP over the array and check the string length to find the longest one.
While József Szikszai's solution works, it may be too complex for the functionality you need. This would work just as well: (also with the same limitation that it willl only output the first longest word and no other ones of the same length)
DATA string TYPE string VALUE 'this is a string'.
DATA parts TYPE STANDARD TABLE OF string.
DATA biggest TYPE string.
FIELD-SYMBOLS <part> TYPE string.
SPLIT string AT space INTO TABLE parts.
LOOP AT parts ASSIGNING <part>.
IF STRLEN( <part> ) > STRLEN( biggest ).
biggest = <part>.
ENDIF.
ENDLOOP.
WRITE biggest.
Edit: I assumed 'biggest' meant longest, but if you actually wanted the word that would be last in an alphabet, then you could sort the array descending and just output the first entry like this:
DATA string TYPE string VALUE 'this is a string'.
DATA parts TYPE STANDARD TABLE OF string.
DATA biggest TYPE string.
SPLIT string AT space INTO TABLE parts.
SORT parts DESCENDING.
READ TABLE parts INDEX 1 INTO biggest.
WRITE biggest.
With ABAP 740, you can also shorten it to:
SPLIT lv_s AT space INTO TABLE DATA(lt_word).
DATA(lv_longest) = REDUCE string( INIT longest = `` FOR <word> IN lt_word NEXT longest = COND #( WHEN strlen( <word> ) > strlen( longest ) THEN <word> ELSE longest ) ).
DATA(lv_alphabetic) = REDUCE string( INIT alph = `` FOR <word> IN lt_word NEXT alph = COND #( WHEN <word> > alph THEN <word> ELSE alph ) ).
If "biggest" means "longest" word here is the Regex way to do this:
FIND ALL OCCURRENCES OF REGEX '\w+' IN string RESULTS DATA(words).
SORT words BY length DESCENDING.
WRITE substring( val = string off = words[ 1 ]-offset len = words[ 1 ]-length ).
I want to retrieve specific text within a column in a SAS file.
The file would like the following:
Patient Location infoTxt
001 B Admission Code: 123456 X
Exit Code: 98765W
002 C Admission Code: 4567 WY
Exit Code: 76543Z
003 D Admission Code: 67890 L
Exit Code: 4321Z
I want to retrieve just the information after the colon for Admission Code and Exit Code and put them in their own columns. The 'codes' can be any combination of letters, numbers, and blank spaces. The new data would look like the following:
Patient Location AdmissionCode ExitCode
001 B 123456 X 8765W
002 C 4567 WY 76543Z
003 D 67890 L 4321Z
I'm not familiar with the functions in SAS, but maybe the logic would look something like the following:
data want;
set have;
do i = 1 to dim(infoTxt)
AdmissionCode = substring(string1, regexpr(":", string) + 1);
ExitCode = substring(string2, regexpr(":", string) + 1);
run;
In the code above, string1 would represent the first line of text in infoTxt and string2 would represent the second line of text infoTxt.
SAS can utilize Perl regular expressions through the family of functions that start with PRX. The tip sheet is a great summary if you are familiar with regular expressions.
PRXMATCH and PRXPOSN can test a regex pattern with capture groups and retrieve the group text.
data have;
input;
text = _infile_;
datalines;
Admission Code: 123456 X Exit Code: 98765W
Admission Code: 4567 WY Exit Code: 76543Z
Admission Code: 67890 L Exit Code: 4321Z
run;
data want;
set have;
if _n_ = 1 then do;
retain rx;
rx = prxparse ('/Admission Code: (.*)Exit Code:(.*)/');
end;
length AdmissionCode ExitCode $50;
if prxmatch(rx,text) then do;
AdmissionCode = prxposn(rx, 1, text);
ExitCode = prxposn(rx, 2, text);
end;
drop rx;
run;
I like a RegEX with a capture buffer as much as the next guy but you could also use input statement features to read this data.
data info;
infile cards n=2 firstobs=2;
input #1 patient:$3. location :$1. #'Admission Code: ' AdmissionCode &$16. #2 #'Exit Code: ' ExitCode &$16.;
cards;
Patient Location infoTxt
001 B Admission Code: 123456 X
Exit Code: 98765W
002 C Admission Code: 4567 WY
Exit Code: 76543Z
003 D Admission Code: 67890 L
Exit Code: 4321Z
;;;;
run;
proc print;
run;
There may be a solution out there that does it all in one data step. This creates two steps to deal with the admission and exit being on different rows-- first a data step, then a join to get it back together.
SAS does have regex syntax but I used SAS character functions instead. substr has 3 arguments, string, start position, and end position-- but end position is optional and I've omitted it to tell it to grab everything after the start position. retain is used to fill in the patient and location in the second row of each group.
data admission exit;
set grep;
retain patient2 location2;
if patient ne '' then do;
patient2=patient;
location2=location;
admissioncode=substr(infoTxt,find(infoTxt,":")+2);
output admission;
end;
else do;
exitcode=substr(infoTxt,find(infoTxt,":")+2);
output exit;
end;
run;
proc sql;
create table dat as select a.patient2 as patient,a.location2 as location,a.admissioncode,b.exitcode
from admission a
left join exit b on a.patient2=b.patient2 and a.location2=b.location2
;
quit;
Provided that you always have the same pattern of colons and line breaks, I think you can do this with scan:
admission_code = scan(infoTxt, 2, '3A0A0D'x);
exit_code = scan(infoTxt, 4, '3A0A0D'x);
This uses the hex literal '3A0A0D'x to specify :, line feed and carriage return as delimiters for the scan function.
I have a character string array in Fortran as ' results: CI- Energies --- th= 89 ph=120'. How do I extract the characters '120' from the string and store into a real variable?
The string is written in the file 'input.DAT'. I have written the Fortran code as:
implicit real*8(a-h,o-z)
character(39) line
open(1,file='input.DAT',status='old')
read(1,'(A)') line,phi
write(*,'(A)') line
write(*,*)phi
end
Upon execution it shows:
At line 5 of file string.f (unit = 1, file = 'input.dat')
Fortran runtime error: End of file
I have given '39' as the dimension of the character array as there are 39 characters including 'spaces' in the string upto '120'.
Assuming that the real number you want to read appears after the last equal sign in the string, you can use the SCAN intrinsic function to find that location and then READ the number from the rest of the string, as shown in the following program.
program xreadnum
implicit none
integer :: ipos
integer, parameter :: nlen = 100
character (len=nlen) :: str
real :: xx
str = "results: CI- Energies --- th= 89 ph=120"
ipos = scan(str, "=", back=.true.)
print*, "reading real variable from '" // trim(str(1+ipos:)) // "'"
read (str(1+ipos:),*) xx
print*, "xx = ", xx
end program xreadnum
! gfortran output:
! reading real variable from '120'
! xx = 120.000000
To convert string s into a real type variable r:
READ(s, "(Fw.d)") r
Here w is the total field width and d is the number of digits after the decimal point. If there is no decimal point in the input string, values of w and d might affect the result, e.g.
s = '120'
READ(s, "(F3.0)") r ! r <-- 120.0
READ(s, "(F3.1)") r ! r <-- 12.0
Answer to another part of the question (how to extract substring with particular number to convert) strongly depends on the format of the input strings, e.g. if all the strings are formed by fixed-width fields, it's possible to skip irrelevant part of the string:
s = 'a=120'
READ(s(3:), "(F3.0)") r
I am a Java developer and new to Matlab. I have a file something like that:
Label_X sdfasf sadfl asdf a fasdlkjf asd
Label_Y lmdfgl ldfkgldkj dkljdkljdlkjdklj
Label_X sfdsa sdfsafasfsafasf 234|3#ert 44
Label_X sdfsfdsf____asdfsadf _ dsfsd
Label_Y !^dfskşfsşk o o o o 4545
What I want is:
A vector (array) includes labels:
Label Array:
Label_X
Label_Y
Label_X
Label_X
Label_Y
and a List (has five elements for our example) and every element of list has elements size of delimited strings. I mean
Element Number Value(List of strings) Element size of value list
-------------- ---------------------- --------------------------
1 sdfasf,sadfl,asdf,a,fasdlkjf,asd 6
2 lmdfgl,ldfkgldkj,dkljdkljdlkjdklj 3
3 sfdsa,sdfsafasfsafasf,234|3#ert,44 4
4 sdfsfdsf____asdfsadf,_,dsfsd 3
5 !^dfskşfsşk,o,o,o,o,4545 6
I know it is pretty simple with Java but I don't know how to implement it in Matlab.
PS: What I am doing is that. I have a text file includes tweets of people. First word is label at row, and other words are corresponding words related to that label. I will have a list of labels and another list of list that holds words about each label.
This probably isn't optimal, but it should do the trick
all = textread('test.txt', '%s', 'delimiter', '\n','whitespace', '');
List = cell(size(all));
for i = 1:size(all)
[List{i}.name remain] = strtok(all{i}, ' ');
[List{i}.content remain] = strtok(remain, ' ');
j = 0;
while(size(remain,2)>0)
j = j+1;
List{i}.content = [List{i}.content temp ','];
[temp remain] = strtok(remain, ' ');
end
List{i}.size = j;
end
The best construct for this in Matlab is the cell. Cells can contain one object, of any type, and are typically found in arrays themselves. Something like this should work, and be pretty optimal (Assuming you don't expect more than 10K lines);
output=cell(10000,1); %This should be set to the maximum number of lines you ever expect to have
output_names=cell(size(output));
output_used=false(size(output));
fid=fopen('filename.txt','r');
index=0;
while ~feof(fid)
index=index+1;
line=fgets(fid);
splited_names=regexp(line,'\w*','split');
output{index}=splited_names(2:end);
output_names{index}=splited_names(1);
output_used(index)=true;
end
output=output(output_used);
output_names=output_names(output_used);