SAS: Match single word within string values of a single variable then replace entire string value with a blank - string

I'm working in SAS 9.2, in an existing dataset. I need a simple way to match a single word within string values of a single variable, and then replace entire string value with a blank. I don't have experience with SQL, macros, etc. and I'm hoping for a way to do this (even if the code is less efficient" that will be clear to a novice.
Specifically, I need to remove the entire string containing the word "growth" in a variable "pathogen." Sample values include "No growth during two days", "no growth," "growth did not occur," etc. I cannot enter all possible strings since I don't yet know how they will vary (we have only entered a few observations so far).
TRANSWD and TRANSLATE will not work as they will not allow me to replace an entire phrase when the target word is only a part of the string.
Other methods I've looked at (for example, a SESUG paper using PRX at http://analytics.ncsu.edu/sesug/2007/CC06.pdf) appear to remove all instances of the target string in every variable in the dataset, instead of just in the variable of interest.
Obviously I could subset the dataset to a single variable before I perform one of these actions and then merge back, but I'm hoping for something less complicated. Although I will certainly give something more complicated a shot if someone can provide me with sample code to adapt (and it would be greatly appreciated).
Thanks in advance--Kim

Could you be a little more clear on who the data set is constructed? I think mjsqu's solution will work if your variable pathogen is stored sentence by sentence. If not then I would say your best bet is to parse the blocks into sentences and then apply mjsqu's solution.
DATA dataset1;
format Ref best1.
pathogen $40.;
input Ref pathogen $40. ;
datalines;
1 No growth during two days
2 no growth,
3 growth did not occur,
4 does not have the word
;
RUN;
DATA dataout;
SET dataset1;
IF index(lowcase(pathogen),"growth") THEN pathogen="";
RUN;

Related

extract certain text after certain characters

what is the easiest way with an Excel formula to extract certain details from a cell? So for example, if this is in cell A1 column=""HMI_LOCATE"" px=""CLASS"" position=""99"" validation=""ROOM"" then I'm trying to extract just the data the falls in between the double "" after the px= so in this example, I need to extract just the letters CLASS and nothing else, what is the easiest way to extract that data, the part I'm trying to extract won't always be 5 characters long it could be much longer or shorter.
Do you want to achieve this?
With o365 you can use this formula
=FILTERXML("<t><s>"&SUBSTITUTE(A1,CHAR(34)&CHAR(34),"</s><s>")&"</s></t>","//s[position() mod 2 = 0]")
or for older EXCEL-versions
=IFERROR(INDEX(FILTERXML("<t><s>"&SUBSTITUTE($A$1,CHAR(34)&CHAR(34),"</s><s>")&"</s></t>","//s"),ROW(A1)*2),"-")
This splits the string at the quotation marks (CHAR(34)) and builds an array of elements. Then every second element is put out.
For tons of other possibilities have a look at this awesome guide by JvdV.
EDIT:
To get the element after px= no matter where it is, you can use
=LET(list,
FILTERXML("<t><s>"&SUBSTITUTE($A$1,CHAR(34)&CHAR(34),"</s><s>")&"</s></t>","//s"),
INDEX(list,MATCH("px=",list,0)+1)
)
The LET-function lets you assign functions to variables which then can be used for further calculations.

Comparing values with some alphanumerics

I've looked through the forums but couldn't find any questions (with answers) that helped. Any guidance would be appreciated.
I'm working on an Excel/Access project that cross references error codes. The codes are twelve digits long, with the first half and second half that need to be sortable. 99% of these codes are entirely numeric, but the 1% that includes letters is really screwing me up.
For example, a common error code might be "386748000123". This would be split into "386748" and "000123", with the first being the code for the type of system and the second being the type of error.
But then the 1% are something like this: "0957AB003A41". "0957AB", and "003A41".
If I format the columns (in Excel and Access) as numbers than the numeric comparisons are far easier, "000123" equals "123". If I format the column as strings than I can compare the alphanumeric values but then "000123" and "123" stop crossing.
The possible solution I've come across is utilizing the Val function inside an Access query to purely compare values but I've never used it and it seems like only a partial fix. Val ignores the strings, which means "0957AB" will have the same value as "0957XY" - and that doesn't work for this project.
I'm sure many of you have had similar issues, so I'm hoping to get some ideas on different ways the problem has been approached and resolved.
You have not provided a minimal sample of the data and also the output, also there is no code that I can amend it for you, but the only part that you are having problem is comparing the alphanumeric ones, you should format all of your data as strings and then compare. to make 123 be equal to "000123" you need to just format the numeric ones as string as below:
format(123,"000000")
which will give you "000123"
Edit
from you comment I learned that the problem is the key that is always or often a number, format will return the proper string for comparison, if it is already a 6-character string it will return itself so there would not be a problem:
do something like this:
if format(key,"0000000")=format(code,"000000") then
'do something
end if

excel vba Delete entire row if cell contains the GREP search

I have a single column of text in Excel that is to be used for translating into foreign languages. The text is automatically generated from an InDesign File. I would like to clean it up for the translator by removing rows that simply contain a number ("20", 34.5" etc), or if they contain a measurement "5mm", "3.5 µm", etc. I've found many posts (see link below) on how to remove a row with specific string, but none that use search strings, such as those I typically use with GREP searches: "\d+" and "\d.\d µm"
How would I do this? I am on Mac iOS if that helps.
Note that I would need to delete the row if the cell only contains a number or a measurement, not if the number is contained within a phrase, sentence, or paragraph, etc.
https://stackoverflow.com/a/30569969
It may not be what you are looking for, but how about just sorting the column and remove the rows starting with numbers? It is a manual approach but from what I understand this translation process only happens from time to time. Am I right?
I see two possible issues in your question:
How to work with regular expressions in Excel?
How to delete rows in a loop?
Let me start with the second question: when you want to create a for-loop in order to remove items from a list, you MUST start at the end and go back to the beginning (it's a beginner's trick, but a lot of people trip over it.
About the first question: this is a very useful post about this subject, it's too large to even give a summary here.

SPSS converting a string into a numeric variable issue

I have a string variable with lots of parentheses and other punctuation e.g. _LSC Debt licensed work. How can I easily convert it to a numeric variable when I already have a specified code list for it? i.e. I don't want it to automatically recode everything because it uses the wrong values against the labels.
Create a dataset with two variables: a string holding the current messy name and a numeric variable holding the new code. Then, with both the original dataset and the lookup one sorted by the string, do MATCH FILES specifying a table match (or use Data > Merge Files > Add Variables).
You can prepare a separate file which includes two variables:
- one contains each of the possible values in the original string variable to be recoded (make sure the name and width are the same as your original variable)
- the second contains the new values you want to recode to.
when you set this up, match the files like this:
get file="filepath\Your_Value_Table.sav".
sort cases by YourOriginalVarName.
dataset name ValTab.
get file="filepath\Your_Original_File.sav".
sort cases by YourOriginalVarName.
match files /file=* /table=ValTab /by YourOriginalVarName.
exe.
At this point your original file will contain a new variable that has the codes you wanted.
In general I agree with the solution provided by others. However, I would like to suggest an extra step, which could make your look-up file (see the answer of eli-k and JKP) a bit better.
The point is that your string variable with lots of parentheses and other punctuation probably also has different ways to write the same thing.
For example:
_LSC Debt licensed work
LSC Debt licensed work
_LSC Debt Licensed Work
etc.
You could create a lookup-table with three variables: the unique values of the original string variable, a cleaned-up version of that variable, and finally the numeric value you want to attach.
The advantage of the cleaned-up version is that you can identify more easily the same value although it is written differently.
You could clean up using several functions:
string CleanedUpVersion (A40).
compute CleanedUpVersion = REPLACE(RTIM(LTRIM(UPCASE(YourOriginalVarName))),'_','').
execute.
In this basic example we convert to capital letters, delete leading and trailing blanks and remove the underscore by replacing it by nothing.
Overall this could help to avoid giving different numbers to unique values in your original variable that mean the same thing, while you would like them to have the same number.

Looking up Bigrams in Excel

Suppose I have a list of two-word pairs in a column in Excel. These words are delimited by a space so that a typical pair might look like "extreme happiness". The goal is to search for these 'bigrams' in a larger string located in another column. The issue is that the bigram will only be found if the two words are together and separated by a space. What would be preferable is if Excel could look for both words anywhere in a given larger string. It is crucial that the bigrams occupy one cell each since a score is assigned to each bigram and in fact the function used VLOOKUPs this value based on the bigram cell value. Would it make sense to change the space between any two words to a - or some other character? Is there a way to have Excel look up each value one at a time (perhaps by recognizing this character and passing through the larger string twice, that is, once for each word)?
Example: "The weather last night was extremely cold, but the warm fire gave me some happiness."
Here we would like to find both the word 'extreme' within the word extremely and the word happiness. Currently Excel would not be successful in doing this since it would just look for "extreme happiness" and determine that no such string exists.
If the bigram in the row below "extreme happiness" reads "weather gave" (for some reason) Excel will go check whether that bigram exists in the larger string and return a second score. This is done so that at the end every score can be added together.
This is pretty easy with a couple of formulas. See screenshot below:
The logic is simple. Assuming your bigram is in B1, we can input the following in C1. This will replace the spaces with *, which is Excel's wildcard character.
=SUBSTITUTE(B2," ","*")
Then we concatenate it to give us a wildcarded beginning and end.
=CONCATENATE("*",SUBSTITUTE(B2," ","*"),"*")
We then use a simple COUNTIF against the statement (here in A1) to return to us a count of occurence.
=COUNTIF(A2,CONCATENATE("*",SUBSTITUTE(B2," ","*"),"*"))
A simple IF check enclosing the above, with condition >0, can be used to give us either Yes or No.
=IF(COUNTIF(A2,CONCATENATE("*",SUBSTITUTE(B2," ","*"),"*"))>0,"Yes","No")
Let us know if this helps.

Resources