Creating a variable if key term appears in text - excel

I am trying to create a variable "thromboembolism death", 0 if it is not the cause of death, 1 if it is.
Is there any way to sort through this data set through spss / excel in order to create a new variable if one of the key terms e.g (DVT, Pulmonary embolism, thromboembolism) appear in the line of text? Here is what my data looks like right now.
https://i.stack.imgur.com/WDrBs.png
Also the data set is very large. 250000+ cases. I am new to data analysis, thanks for the help!

In SPSS, assuming you have a variable named death_cause with description verbatims:
COMPUTE thromboembolism_death = (INDEX(UPCASE(death_cause),'DVT') > 0)
OR (INDEX(UPCASE(death_cause),'PULMONARY EMBOLISM') > 0)
OR (INDEX(UPCASE(death_cause),'THROMBOEMBOLISM') > 0).
EXE .
In Excel, you could take a similar approach. Assuming your text verbatims are in column A:
=IF(OR(ISNUMBER(SEARCH("DVT",A1)),ISNUMBER(SEARCH("PULMONARY EMBOLISM",A1)),ISNUMBER(SEARCH("THROMBOEMBOLISM",A1))),1,0)
Alternatively, if you're comfortable using SUMPRODUCT(), the formula gets a bit shorter. Assuming you list your "strings to search for" in cells C2:C5:
=SUMPRODUCT(--ISNUMBER(SEARCH(C2:C5,A1)))>0
Note that all of the above options are case-insensitive.

Related

Concatenate the shortcut st, nd, rd to column value - Excel

Column A has numbers from 1 - 5 and in column B i want to concatenate the number of Column A with the relevant nth term as indicated in the image below. Any help will be greatly appreciate!
Without using VBA, your best option would be the "CHOOSE()" function.
Try something like this for any number > 0:
=IF(AND(MOD(ABS(A1),100)>10,MOD(ABS(A1),100)<14),"th",CHOOSE(MOD(ABS(A1),10)+1,"th","st","nd","rd","th","th","th","th","th","th"))
You can set up a named "key" separately, much like the table you are showing, and then reference the key to replace any number with the desired output.
You can then indexmatch/vlookup the number, referencing the table, to find the output.
For ex:
=vlookup($A1,key,2,FALSE)
you could use nested IF functions and RIGHT like this
=IF(OR(RIGHT(H2,2)="11",RIGHT(H2,2)="12",RIGHT(H2,2)="13"),CONCAT(H2,"th"),IF(RIGHT(H2,1)="1",CONCAT(H2,"st"),IF(RIGHT(H2,1)="2",CONCAT(H2,"nd"),IF(RIGHT(H2,1)="3",CONCAT(H2,"rd"),CONCAT(H2,"th")))))
Probably not the fastest performance wise

Separating values that are combined in one string

I would like to solve this either in Excel or in SPSS:
I have categorical data (each number representing a medical diagnosis) that are combined into single cells. In other words, a row (patient) has multiple diagnoses. However, I would like to know the frequencies of each diagnosis. What is the best way to go about this? (See picture for reference)
For SPSS:
First just creating some sample data to demonstrate on:
data list free/e_cerv_dis_state (a20).
begin data
"{1/2/3/6}" "{1/2/4}" "{2/4/5}" "{1/5/6}" "{4}" "{4/5/6}" "{1/2/3/4/5/6}"
end data.
Now the following code will create a separate variable for each possible diagnosis, and will put a 1 in it if the diagnosis exists in the original variable.
do repeat vr=diag1 to diag9/vl=1 to 9.
compute vr=char.index(e_cerv_dis_state, string(vl, f1) ) > 0.
end repeat.
freq diag1 to diag6.
Note this will only work for up to 9 diagnoses. If you have more than that the solution will have to be adapted to multiple digits.
Assuming that the number of columns is fairly regular, I would suggest using text to columns, and then using COUNTIF on the cells if they are the value wanted. However there is a more robust and reproducible solution that would involve using SQL. If you download the free version of SQL Express here: https://www.microsoft.com/en-gb/sql-server/sql-server-downloads
Then you can import your table of data, here's how to do that: How to import an Excel file into SQL Server?
Then you could use the more friendly SQL database to get the answers you want. For example you can use a select statement that would say:
SELECT count(e_cerv_dis_state)
WHERE e_cerv_dis_state = '6'
It would also be possible to use a CASE WHEN statement to add-in the names of the diagnoses.

SAS: Match single word within string values of a single variable then replace entire string value with a blank

I'm working in SAS 9.2, in an existing dataset. I need a simple way to match a single word within string values of a single variable, and then replace entire string value with a blank. I don't have experience with SQL, macros, etc. and I'm hoping for a way to do this (even if the code is less efficient" that will be clear to a novice.
Specifically, I need to remove the entire string containing the word "growth" in a variable "pathogen." Sample values include "No growth during two days", "no growth," "growth did not occur," etc. I cannot enter all possible strings since I don't yet know how they will vary (we have only entered a few observations so far).
TRANSWD and TRANSLATE will not work as they will not allow me to replace an entire phrase when the target word is only a part of the string.
Other methods I've looked at (for example, a SESUG paper using PRX at http://analytics.ncsu.edu/sesug/2007/CC06.pdf) appear to remove all instances of the target string in every variable in the dataset, instead of just in the variable of interest.
Obviously I could subset the dataset to a single variable before I perform one of these actions and then merge back, but I'm hoping for something less complicated. Although I will certainly give something more complicated a shot if someone can provide me with sample code to adapt (and it would be greatly appreciated).
Thanks in advance--Kim
Could you be a little more clear on who the data set is constructed? I think mjsqu's solution will work if your variable pathogen is stored sentence by sentence. If not then I would say your best bet is to parse the blocks into sentences and then apply mjsqu's solution.
DATA dataset1;
format Ref best1.
pathogen $40.;
input Ref pathogen $40. ;
datalines;
1 No growth during two days
2 no growth,
3 growth did not occur,
4 does not have the word
;
RUN;
DATA dataout;
SET dataset1;
IF index(lowcase(pathogen),"growth") THEN pathogen="";
RUN;

Prevent comma-separated list of numbers being interpreted as single large value

33266500,332665100,332665200,332665300 was the original value, cell should look like this: 33266500,332665100,332665200,332665300 but what I see as the cell value in excel is 3.32665E+34
So the question is I want to convert it into the original string. I have found format function on google and I used it like these
format(3.32665E+34,"standard")
giving it as 332,6650,033,266,510,000,000,000
How to parse it or get back the orginal string? I belive format is the function in vba.
Excel has a 15 digit precision limit. If the numbers are already shown like this when you access the file, there is no way to get the number back - you have already lost some digits. VBA code and formulas will not help you.
If this is not the case, you can add a single quote ' mark before the number to store it as text. This will ensure Excel does not try to treat it as a number and thus lose precision.
If you want the value kept exactly, store the data as a string, not as a number. The data type you are using simply doesn't have the ability to do what you are asking it to do.
If you're starting with an Excel file that has already been created then you've already lost the information: Excel has tried to understand what it was given and its best guess has turned out to be wrong. All you can do (if you can't get the source data) is go back to the creator of the Excel file and tell them what's wrong.
If you're starting with, say, a text file that you're importing, then the news is much better:
If you're importing manually using the Text Import Wizard, then at "Step 3 of 3" you need to set "Column Data Format" for the problem field to "Text".
If you're using a macro, you'll need to specify a value for the TextFileColumnDataTypes property that does the same thing. The easiest way to get it right is to use the Macro Recorder.
If you want the four values in the string to be separate cells, then again, look at the Text Import Wizard settings: in Step 1 of 3 you need to set "Delimited" data type (usually the default) and in Step 2 make sure that "Comma" is checked.
The value needs to be entered into the cell as a string. You need to make whatever it is that inserts the value preceed the value with a '.

Converting a number (non-currency) value to a text entry

Easy enough concept, but I have no idea where to start when it comes to creating a UDF, which is the only thing I can find any mention of. I have a column that populates on source sheets with either a 1 or 2. I want to do something so that all of the "1's" shows as one text entry("AA" for example) and all "2's" show as a different entry(say "BB"). Is this possible without a UDF; and if not then is there any advice on where to start?
You can use custom formatting for this. Right-click the column in question and choose "Format Cells." In the dialog, choose "Custom" and in the box at the top enter:
[=1]"AA";[=2]"BB";General
This assumes that the "1" or "2" is the sole content of the cell. Any other number or text will display in the General format.
This may help you as well. It is a conditional statement that will reference one cell the check if there is content, if not then it will put the word "None" in, otherwise it will put the contents of the cell.
=IF((Sheet1!J1089)="","None",Sheet1!J1089)
Just to update anyone else that may be interested. I have a solution that I am using. Had to go the vba route, but I've got it set up so that my macro for running reports runs the following:
Sub Conversion()
Dim X As Long, DBCodes() As String
DBCodes = Split("AA,BB,CC", ",")
For X = 1 To 3
Columns("H").Replace X, DBCodes(X - 1), xlWhole
Next
End Sub
I can change the split values and the line after for as many more values as I need replacing, though it would take fiddling with to find the point where too many values would make it impractical. Also, it makes a world of difference where I put in the line to run this; found the best spot though and even the reports that are 600+ rows the conversion only adds a couple seconds.

Resources