I am having trouble figuring out how to extract specific text within a string. My dataset has been pulled from de-identified electronic health records, and contains a list of every medication that our patients have been prescribed. I am, however, only concerned with a specific list of medications, which I have in another table. Within each cell is the name of the medication, dose, and form (Tabs, Caps, etc.) [See image]. Much of this information is not important for my analysis though, and I only need to extract the medication names that match my list. It might also be useful to extract the first word from each string, as it is (in most cases) the name of the medication.
I have examined a number of different methods of pulling substrings, but haven't quite found something that meets my needs. Any help would be greatly appreciated.
Thanks.
Data DRUGS;
infile datalines flowover;
length drug1-drug69 $20;
array drug[69];
input (drug1-drug69)($);
datalines;
AMITRIPTYLINE
AMOXAPINE
BUPROPION
CITALOPRAM
CLOMIPRAMINE
DESIPRAMINE
DOXEPIN
ESCITALOPRAM
FLUOXETINE
FLUVOXAMINE
IMIPRAMINE
ISOCARBOXAZID
MAPROTILINE
MIRTAZAPINE
NEFAZODONE
NORTRIPTYLINE
PAROXETINE
PHENELZINE
PROTRIPTYLINE
SERTRALINE
TRANYLCYPROMINE
TRAZODONE
TRIMIPRAMINE
VENLAFAXINE
AMITRIP
ELEVIL
ENDEP
LEVATE
ADISEN
AMOLIFE
AMOXAN
AMOXAPINE
DEFANYL
OXAMINE
OXCAP
WELLBUTRIN
BUPROBAN
APLENZIN
BUDEPRION
ZYBAN
CELEXA
ANAFRANIL
NORPRAMIN
SILENOR
PRUDOXIN
ZONALON
LEXAPRO
PROZAC
SARAFEM
LUVOX
TOFRANIL
TOFRANIL-PM
MARPLAN
LUDIOMIL
REMERON
REMERONSOLTAB
PAMELOR
PAXIL
PEXEVA
BRISDELLE
NARDIL
VIVACTIL
ZOLOFT
PARNATE
OLEPTRO
SURMONTIL
EFFEXOR
DESVENLAFAXINE
PRISTIQ
;;;;
run;
Data DM4_;
if _n_=1 then set DRUGS;
array drug[69];
set DM4;
do _i = 1 to countw(Description,' ().,');
_med = scan(Description,_i,' ().,');
_whichmed = whichc(_med, of drug[*]);
if _whichmed > 0 then leave;
end;
run;
Data DM_Meds (drop = drug1-drug69 _i _med _whichmed);
Set DM4_;
IF _whichmed > 0 then anti = _med;
else anti = ' ';
run;
This is a fairly common problem with a bunch of possible solutions depending on your needs.
The simplest answer is to create an array, assuming you have a smallish number of medicines. This isn't necessarily the fastest solution, but it would work fairly well and is simple to construct. Just get your drug list into a dataset, transpose it to horizontal (one row with lots of meds), then load it up this way. You iterate over the words in the name of the medicine and see if any of them are in the medicine list - if they are, then bingo, you have your drug! In real use of course drop the drug: variables afterwards.
This works a bit better than the inverse (searching each drug to see if it's in the medicine name) since usually there are more words in the drug list than in the medicine name. The hash solution might be faster, if you're comfortable with hashes (load the drug list into a hash table then use find() to do the same as what whichc is doing here).
data have;
input #1 medname $50.;
datalines;
PROVIGIL OR
ENSURE HIGH PROTEIN OR LIQD
BENADRYL 25 MG OR CAPS
ECOTRIN LOW STRENGTH 81 MG OR TBEC
SPIRONOLACTONE 25 MG PO TABS
NORVASC 5 MG OR TABS
FLUOXETINE HCL 25MG
IBUPROFEN 200MG
NEFAZODONE TABS OR CAPS 20MG
PAXIL (PAROXETINE HCL) 25MG
;;;;
run;
data drugs;
infile datalines flowover;
length drug1-drug19 $20;
array drug[19];
input (drug1-drug19) ($);
datalines;
AMITRIPTYLINE
AMOXAPINE
BUPROPION
CITALOPRAM
CLOMIPRAMINE
DESIPRAMINE
OXEPIN
ESCITALOPRAM
FLUOXETINE
FLUVOXAMINE
IMIPRAMINE
ISOCARBOXAZID
MAPROTILINE
MIRTAZAPINE
NEFAZODONE
NORTRIPTYLINE
PAROXETINE
PHENELZINE
PROTRIPTYLINE
;;;;
run;
data want;
if _n_ = 1 then set drugs;
array drug[19];
set have;
do _i = 1 to countw(medname,' ().,');
_medword = scan(medname,_i,' ().,');
_whichmed = whichc(_medword, of drug[*]);
if _whichmed > 0 then leave;
end;
run;
This should be an easy task for PROC SQL.
Let's say you have patient information in table A and drug names in table B (long format, not the wide format you gave). Here is the code filtering table A rows into table C where description in A contains drug name in B.
PROC SQL;
CREATE TABLE C AS SELECT DISTINCT *
FROM A LEFT JOIN B
ON UPCASE(A.description) CONTAINS UPCASE(B.drug);
QUIT;
Related
In my dataset, the last name (lname) occasionally has the generational suffix attached. Regarding the generational suffix:
there are no spaces or other possible delimiters between the lname variable and the suffix
the suffix ranges between 2 and 4 characters in length
the suffix is a mix of lowercase, uppercase, and proper case
the suffix sometimes includes a combination of integers and characters
I tried to think simple solutions first. I couldn't think of any using Excel because all of their string solutions require having a consistent position of the values to be removed.
In SAS, PARSE requires a delimiter, and TRIM requires a consistent position.
In the syntax I've attached are four different approaches I tried. None of them were successful, and I totally admit user error. I'm not familiar with any of them other than COMPRESS, and then only for removing blanks.
Is there a way I can make a new variable for last name that doesn't have the generational suffix attached?
Thank you so much!
This first piece applies to each of the my attempts.
data want;
input id lname $ fname $;
datalines;
123456 Smith John
234567 SMITH ANDREW
345678 SmithJr Alan
456789 SMITHSR SAM
789012 smithiii robert
890123 smithIIII william
901234 Smith4th Tim
;
run;
My attempts start here.
/* COMPRESS */
data want;
set have;
lname2 = compress(lname,'Jr');
put string=;
run;
/* TRANWARD */
data want;
set have;
lname2 = tranwrd(lname,"Jr", "");
lname2 = tranwrd(lname,"Sr", "");
lname2 = tranwrd(lname,"III", "");
run;
/* PRXCHANGE */
data want;
set have;
lname2 = lname;
lname2 = prxchange('s/(.*)(jr|sr|iii|iv)$/$1/i',1,trim(lname));
run;
/* PRXMATCH */
data want;
set have;
if prxmatch('/Jr|Sr|III/',lname) then lname2 = '';
run;
You can not use compress() for this purpose at all
Instead of tranwrd (it requires a delimiter) you might try to use translate. But you will not solve the problem of replacing your pattern in the beginning or midle of the word
The example of prxmatch is below.
data have;
input id lname $ fname $;
datalines;
123456 Smith John
234567 SMITH ANDREW
345678 SmithJr Alan
456789 SMITHSR SAM
789012 smithiii robert
890123 smithIIII william
901234 Smith4th Tim
901235 SRith4th Tim
;
run;
data want;
set have;
/* Use PRXPARSE to compile the Perl regular expression. */
patternID=prxparse('/(JR$)|(SR$)|(III$)/');
/* Use PRXMATCH to find the position of the pattern match. */
position=prxmatch(patternID, compress(upcase(lname)));
put position=;
if position then do;
put lname=;
lname2 = '';
end;
run;
I think you're fine with your prxchange method, for me it's the most reliable and easy to maintain, I would just change 2 things:
us the 'o' modifier to compile only once the regex
Use a strip instead of a trim (strip is an equivalent of ltrim + rtrim)
data want;
set have;
attrib lname2 format=$50.;
lname2 = prxchange('s/(.*)(jr|sr|iii|iv)$/$1/oi', 1, strip(lname));
run;
I'm trying to find the amount of words in this table:
Download Table here: http://www.mediafire.com/file/m81vtdo6bdd7bw8/Table_RandomInfoMiddle.mat/file
Words are indicated by the "Type" criteria, being "letters". The key thing to notice is that not everything in the table is a word, and that the entry "" registers as a word. In other words I need to determine the amount of words, by only counting "letters", except if it is a "missing".
Here is my attempt (Yet unsuccessful - Notice the two mentions of "Problem area"):
for col=1:size(Table_RandomInfoMiddle,2)
column_name = sprintf('Words count for column %d',col);
MiddleWordsType_table.(column_name) = nnz(ismember(Table_RandomInfoMiddle(:,col).Variables,{'letters'}));
MiddleWordsExclusionType_table.(column_name) = nnz(ismember(Table_RandomInfoMiddle(:,col).Variables,{'<missing>'})); %Problem area
end
%Call data from table
MiddleWordsType = table2array(MiddleWordsType_table);
MiddleWordsExclusionType = table2array(MiddleWordsExclusionType_table); %Problem area
%Take out zeros where "Type" was
MiddleWordsTotal_Nr = MiddleWordsType(MiddleWordsType~=0);
MiddleWordsExclusionTotal_Nr = MiddleWordsExclusionType(MiddleWordsExclusionType~=0);
%Final answer
FinalMiddleWordsTotal_Nr = MiddleWordsTotal_Nr-MiddleWordsExclusionTotal_Nr;
Any help will be appreciated. Thank you!
You can get the unique values from column 1 when column 2 satisfies some condition using
MiddleWordsType = numel( unique( ...
Table_RandomInfoMiddle{ismember(Table_RandomInfoMiddle{:,2}, 'letters'), 1} ) );
<missing> is a keyword in a categorical array, not literally the string "<missing>". That's why it appears blue and italicised in the workspace. If you want to check specifically for missing values, you can use this instead of ismember:
ismissing( Table_RandomInfoMiddle{:,1} )
I need to extract the string after the numbers. Although the problem is that the number of digits at the front of the string is inconsistent. What I need is something similar like the Flash Fill in Excel. But I'll be doing it for 100K+ rows so Excel might not be able to handle the data. For example:
12345678aaa#mail.com
12345bbb#mail.com
123456789ccc#mail.com
I want the create another variable with the extracted string such as the following:
aaa#mail.com
bbb#mail.com
ccc#mail.com
Is this possible?
Thank you in advance!
You can use regular expression substitution (PRXCHANGE), or a careful use of the VERIFY function.
Example:
data have;
input email $char25.; datalines;
12345678aaa#mail.com
12345bbb#mail.com
123456789ccc#mail.com
1234567890123456789012345
;
data want;
set have;
mail1 = prxchange('s/^\d+//',-1,email);
if email in: ('0','1','2','3','4','5','6','7','8','9') then
mail2 = substr(email||' ',verify (email||' ', '0123456789'));
run;
Example above should be OK,
but assuming that some email addresses could have numbers, 123abc001#mail.com for instance, my code below should help:
data have;
input email $char25.; datalines;
12345678abc01#mail.com
12345bcde#mail.com
123456789cdefg1#mail.com;
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_HAVE_0003 AS
SELECT t1.email,
/* want */
(substrn(t1.email,INDEXC( t1.email, SUBSTRN(COMPRESS(t1.email, 'abcdefghijklmnopqrstuvwxyz', 'k'), 1, 1))))
AS want
FROM WORK.HAVE t1;
QUIT;
Firstly, we use COMPRESS functionm to leave only char values;
Then SUBSTRN - to have the first character appearing in email address;
After than INDEXC - returns position of the character;
Finally SUBSTRN again - leaves the rest of the email, starting from the position provided from step before.
final look:
[1]: https://i.stack.imgur.com/hFftg.png
In SAS, I need a PROC TABULATE where labels are repeated so that it's easier on Excel to find them using INDEX-MATCH. Here is an example with sashelp.cars.
The first PROC TABULATE has the advantage of having repeating labels, which is needed for the INDEX-MATCH. But, its flaw is that SAS only gives the non missing values.
data cars;
set sashelp.cars;
run;
proc sort data=cars;
by make;
run;
This doesn't give all labels. I would like a table with 3 continents by column (Europe, Asia, USA) and every car type (Sedan, SUV, Wagon, Sports...).
PROC TABULATE DATA = cars;
option missing=0;
by make;
CLASS make type Type Origin / mlf MISSING ;
TABLE (
(type*make)
), (Origin='') / printmiss nocellmerge ; RUN;
So, in order to have all the 3 continents by colum, and every type of car (Sedan, SUV, Wagon, Sports...), I use CLASSDATA, as suggested:
Data level;
set cars;
keep make type Type Origin;
Run;
PROC TABULATE DATA = cars MISSING classdata=level;
option missing=0;
by make;
CLASS make type Type Origin / mlf MISSING ;
TABLE (
(make*type)
), (Origin='') / printmiss nocellmerge ;
RUN;
Data level;
set cars;
keep make type Type Origin;
Run;
PROC TABULATE DATA = cars MISSING classdata=level;
option missing=0;
by make;
CLASS make type Type Origin / mlf MISSING ;
TABLE (
(make*type)
), (Origin='') / printmiss nocellmerge ;
RUN;
But this gives a humongous table, and non repeating labels. Is there a midway solution with :
all the columns (3 continents) like in the last table
only the concerned MAKEs, that is the first 6 rows for Acura
repeated labels like in the first PROC TABULATE
Thank you very much,
I advice not exporting the listing of proc tabulate to excel
proc tabulate does not repeat values in the first column for each value in the second, because the output is meant for human reading. This is not the tool you need to write data to excel for further lookup.
I advice not using MATCH but SUMIFS
MATCH is a great function in excel, but is not a good choice for your application, because
it gives an error when it does not find what you look for, and that is why you need all labels in your output
it only supports one criterion, so you need at least 3 of them
it returns a position, so you still need an index function.
Therefore, I advice writing a simple create table
PROC sql;
create table TO_EXPORT as
select REGION, MACTIV, DATE, count(*) as cnt
from data
group by REGION, MACTIV, DATE;
proc export data = TO_EXPORT file="&myFolder\&myWorkbook..xlsx" replace;
RUN;
you will have your data in Excel in a more data oriented format.
To retrieve the data, I advise the following type of excel formula
=sumifs($D:$D,$A:$A,"13-*",$B:$B,$C:$C,"apr2020")`
It adds all counts with left of them the criteria you are looking for.
Because at most one row will meet these criteria, it actually just looks up a count you are looking for.
If that count does not exist, it will just return zero.
Disclaimer:
I did not test this code, so if it does not work, leave a comment and I will.
I have a table as follows:
ID Start End
AB 001 020
VG 004 098
I want to output a single row of ID series as follows:
ID2
AB001
AB002
AB003
...
AB020
VG001
...
VG097
VG098
I am trying to do this with Power Query in Excel as I cannot use R (the tool will be used by another person without access to R).
I am trying Table.InsertRows and Table.RepeatRows after transposing the table. But I am so far unable to use the Start/End values in my query (the number of IDs may vary) or even incrementing the values. I am quite a noob in this and to this day have worked with only minor manipulations of the GUI functions. Any detailed answer will be highly appreciated.
Thank you for your efforts in advance.
Try this - it generates a list from Start - End for each row, applies the ID prefix, then combines the output:
let
ListFunction = (Start, End, Prefix) =>
let
NewList = List.Transform(List.Numbers(Start, End - Start + 1), each Prefix & Number.ToText(_, "000"))
in
NewList,
Source = #table(type table [#"ID"=text, #"Start"=text, #"End"=text],{{"AB","001","020"},{"VG","004","098"}}),
#"Make Lists" = Table.AddColumn(Source, "NewList", each ListFunction(Number.From([Start]), Number.From([End]), [ID])),
#"Combine Lists" = Table.FromList(List.Combine(#"Make Lists"[NewList]), Splitter.SplitByNothing(),{"ID2"})
in
#"Combine Lists"