I have a string variable with lots of parentheses and other punctuation e.g. _LSC Debt licensed work. How can I easily convert it to a numeric variable when I already have a specified code list for it? i.e. I don't want it to automatically recode everything because it uses the wrong values against the labels.
Create a dataset with two variables: a string holding the current messy name and a numeric variable holding the new code. Then, with both the original dataset and the lookup one sorted by the string, do MATCH FILES specifying a table match (or use Data > Merge Files > Add Variables).
You can prepare a separate file which includes two variables:
- one contains each of the possible values in the original string variable to be recoded (make sure the name and width are the same as your original variable)
- the second contains the new values you want to recode to.
when you set this up, match the files like this:
get file="filepath\Your_Value_Table.sav".
sort cases by YourOriginalVarName.
dataset name ValTab.
get file="filepath\Your_Original_File.sav".
sort cases by YourOriginalVarName.
match files /file=* /table=ValTab /by YourOriginalVarName.
exe.
At this point your original file will contain a new variable that has the codes you wanted.
In general I agree with the solution provided by others. However, I would like to suggest an extra step, which could make your look-up file (see the answer of eli-k and JKP) a bit better.
The point is that your string variable with lots of parentheses and other punctuation probably also has different ways to write the same thing.
For example:
_LSC Debt licensed work
LSC Debt licensed work
_LSC Debt Licensed Work
etc.
You could create a lookup-table with three variables: the unique values of the original string variable, a cleaned-up version of that variable, and finally the numeric value you want to attach.
The advantage of the cleaned-up version is that you can identify more easily the same value although it is written differently.
You could clean up using several functions:
string CleanedUpVersion (A40).
compute CleanedUpVersion = REPLACE(RTIM(LTRIM(UPCASE(YourOriginalVarName))),'_','').
execute.
In this basic example we convert to capital letters, delete leading and trailing blanks and remove the underscore by replacing it by nothing.
Overall this could help to avoid giving different numbers to unique values in your original variable that mean the same thing, while you would like them to have the same number.
Related
This is my first StackOverflow question, so apologies if I am unclear.
Currently, my work uses an Excel tracking doc to log project info. The column info is like so:
CELL B1 (Project Number) =IF(B2=""," ",MID(B2,FIND("P2",B2),9))
CELL B2 (Project Name) Client / P2XXXXXXX / Name
Thus, the P2XXXXXXX gets pulled out of B2 and populated into B1.
However, management has recently switched systems, so now, some project numbers have the P2XXXXXXX format and others have a PRJ-XXXXX format.
So we need a formula the produces nothing if the cell is blank and EITHER the P2XXXXXXX number or PRJ-XXXXX number if the cell is not blank.
Is it possible? If any further details are needed, let me know. Thanks in advance!
Well, if the / is always there then this can work:
IF(B2="","",MID(B2,FIND("/",B2,1)+2,9))
assuming the name is always 9 characters.
String Between Two Same Characters
Maybe the next month your company will start using a different first letter or could add more numbers e.g. SPRXXXXXXXXXX. So you could solve this problem by extracting whatever is between those two slashes.
=IF(B2="","",TRIM(MID(B2,FIND("/",B2)+1,FIND("/",B2,FIND("/",B2)+1)-FIND("/",B2)-1)))
Find the first character =FIND("/",B2), but we need the next one:
=FIND("/",B2)+1
Find the second character but search from the postition after the first found:
=FIND("/",B2,FIND("/",B2)+1)
Now get the string between them:
=MID(B2,FIND("/",B2)+1,FIND("/",B2,FIND("/",B2)+1)-FIND("/",B2)-1)
(note how the last minus was 'converted' from a plus to a minus (- + + = -)).
Remove the leading and trailing spaces:
=TRIM(MID(B2,FIND("/",B2)+1,FIND("/",B2,FIND("/",B2)+1)-FIND("/",B2)-1))
Add the condition when the cell is blank:
=IF(B2="","",TRIM(MID(B2,FIND("/",B2)+1,FIND("/",B2,FIND("/",B2)+1)-FIND("/",B2)-1)))
Here's another way using LEFT and RIGHT:
=IF(B2="","",TRIM(LEFT(RIGHT(B2,LEN(B2)-FIND("/",B2)),FIND("/",B2))))
Although you can solve this problem with a combination of slicing, trimming, and complex conditionals, the most expressive and easy to maintain solution is to use regular expressions. Regular expressions have a bit of a learning curve, but there's a great playground website where you can experiment with them, and this page has a pretty good writeup on how regular expressions work in excel.
Specifically, this regular expression addresses the two naming conventions you've highlighted, but it can be updated to support more naming conventions as your company inevitably adds more:
P(RJ-)?((\d){9}|(\d){5})
To break that down from left to right:
P: both patterns start with a "P"
(RJ-)? One pattern follows with "RJ-", but the other doesn't. This is a grouped part of the pattern, and the question mark means that this part of the pattern is optional.
((\d){9}|(\d){5}): by far the nastiest part, but this basically means that there is going to be a sequence of numbers (\d), and there will either be nine of them or five of them. By wrapping the whole thing in parenthesis, they are always the second captured group, no matter the length of the sequence of numbers. This means that you can always extract the project id by looking at the value of the second capture group.
You can also make the expression more generalized by replacing ((\d){9}|(\d){5}) with simply (\d+). That just means "one or more digits." That gives you a much more simplified overall expression of this:
P(RJ-)?(\d+)
Depending on whether or not you care about validating strictly that project ids are 5 OR 9 digits long, that pattern above might be suitable, and it has the benefit of being more flexible. Still, the project ID is in the second captured group.
I've looked through the forums but couldn't find any questions (with answers) that helped. Any guidance would be appreciated.
I'm working on an Excel/Access project that cross references error codes. The codes are twelve digits long, with the first half and second half that need to be sortable. 99% of these codes are entirely numeric, but the 1% that includes letters is really screwing me up.
For example, a common error code might be "386748000123". This would be split into "386748" and "000123", with the first being the code for the type of system and the second being the type of error.
But then the 1% are something like this: "0957AB003A41". "0957AB", and "003A41".
If I format the columns (in Excel and Access) as numbers than the numeric comparisons are far easier, "000123" equals "123". If I format the column as strings than I can compare the alphanumeric values but then "000123" and "123" stop crossing.
The possible solution I've come across is utilizing the Val function inside an Access query to purely compare values but I've never used it and it seems like only a partial fix. Val ignores the strings, which means "0957AB" will have the same value as "0957XY" - and that doesn't work for this project.
I'm sure many of you have had similar issues, so I'm hoping to get some ideas on different ways the problem has been approached and resolved.
You have not provided a minimal sample of the data and also the output, also there is no code that I can amend it for you, but the only part that you are having problem is comparing the alphanumeric ones, you should format all of your data as strings and then compare. to make 123 be equal to "000123" you need to just format the numeric ones as string as below:
format(123,"000000")
which will give you "000123"
Edit
from you comment I learned that the problem is the key that is always or often a number, format will return the proper string for comparison, if it is already a 6-character string it will return itself so there would not be a problem:
do something like this:
if format(key,"0000000")=format(code,"000000") then
'do something
end if
I'm working in SAS 9.2, in an existing dataset. I need a simple way to match a single word within string values of a single variable, and then replace entire string value with a blank. I don't have experience with SQL, macros, etc. and I'm hoping for a way to do this (even if the code is less efficient" that will be clear to a novice.
Specifically, I need to remove the entire string containing the word "growth" in a variable "pathogen." Sample values include "No growth during two days", "no growth," "growth did not occur," etc. I cannot enter all possible strings since I don't yet know how they will vary (we have only entered a few observations so far).
TRANSWD and TRANSLATE will not work as they will not allow me to replace an entire phrase when the target word is only a part of the string.
Other methods I've looked at (for example, a SESUG paper using PRX at http://analytics.ncsu.edu/sesug/2007/CC06.pdf) appear to remove all instances of the target string in every variable in the dataset, instead of just in the variable of interest.
Obviously I could subset the dataset to a single variable before I perform one of these actions and then merge back, but I'm hoping for something less complicated. Although I will certainly give something more complicated a shot if someone can provide me with sample code to adapt (and it would be greatly appreciated).
Thanks in advance--Kim
Could you be a little more clear on who the data set is constructed? I think mjsqu's solution will work if your variable pathogen is stored sentence by sentence. If not then I would say your best bet is to parse the blocks into sentences and then apply mjsqu's solution.
DATA dataset1;
format Ref best1.
pathogen $40.;
input Ref pathogen $40. ;
datalines;
1 No growth during two days
2 no growth,
3 growth did not occur,
4 does not have the word
;
RUN;
DATA dataout;
SET dataset1;
IF index(lowcase(pathogen),"growth") THEN pathogen="";
RUN;
I have several variables of the form:
1 gdppercap
2 19786,97
3 20713,737
4 20793,163
5 23070,398
6 5639,175
I have copy-pasted the data into Stata, and it thinks they are strings. So far I have tried:
destring gdppercap, generate(gdppercap_n)
but get
gdppercap contains nonnumeric characters; no generate
And:
encode gdppercap, gen(gdppercap_n)
but get a variable numbered from 1 to 1055 regardless of the previous value.
Also I've tried:
gen gdppercap_n = real(gdppercap)
But get:
(1052 missing values generated)
Can you help me? As far as I can tell, Stata do not like the fact that the variable contains fraction numbers.
If I understand you correctly, the interpretation as string arises from one and possibly two facts:
The variable name may be echoed in the first observation. If so, that's text and it's inconsistent with a numeric variable. The root problem there is likely to be a copy-and-paste operation that copied too much. Stata typically gives you a choice when importing by copy-and-paste of whether the first row of what you copied is to be treated as variable names or as data, and you need the first choice, so that column headers become variable names, not data. It may be best to go back and do the copy-and-paste correctly. However, Stata can struggle with multiple header lines in a spreadsheet. Alternatively, use import excel, not a copy-and-paste. Alternatively, drop in 1 to remove the first observation, provided that it consistently is superfluous.
Commas indicate decimal places. destring can easily cope with this: see the help for its dpcomma option. Stata has no objection to fractions; that would be absurd. The problem is that you need to flag your use of commas.
Note that
destring is a wrapper for real(), so real() is not a way round this.
encode is for mapping genuine categorical variables to integers, as you discovered, and as its help does explain. It is not for fixing data input errors.
You can write a for loop to convert a comma to a period. I don't quite know your variables but imagine you have a variable gdppercap with information like 1234,343 and you want that to be 1234.343 before you do the destring.
For example:
forvalues x = 1(1)10 {
replace gdppercap = substr(gdppercap, 1, `x'-1) + "." + substr(gdppercap, `x'+1, .)
if substr(gdppercap, `x', 1) == ","
}
DATA STRUCTURE: I have a data set that can be read as an Excel or CSV file. It has the following variable types: dates, time, numeric variables, and what should be numeric variables that incorrectly have characters attached to a number - e.g. -0.011* and 0.023954029324) (the parentheses at the end is in the cell) - due to an error in the program that wrote the file. There are also blank lines between every record, and it is not realistic to delete all of these as I have hundreds of files to manage.
DATA ISSUE: We've determined that some values are correct up to the character (i.e. -0.011 is correct as long as the asterisk is removed), while other values, such as 0.023954029324), are incorrect altogether and should be made missing. Please don't comment on this issue as it is out of my control and at this point all I can do is manage the data until the error is fixed and character values stop being written into the files.
PROBLEM WITH SAS:
1) If I use PROC IMPORT with an Excel file, SAS uses the first eight lines (20 for a CSV file) to determine if a variable is numeric or character. If the asterisk of parenthesis doesn't occur within the first 20 lines, SAS says the variable is numeric, then makes any later cells w/ character values missing. This is NOT okay in the case of the asterisks, because I want to maintain the numeric portion of the value and remove the asterisk in a later data step. Importing Excel files with PROC IMPORT does not allow the GUESSINGROWS option (as it does w/ CSV files, see below). Edit: Also, the MIXED=YES option does NOT work (see comments below - still need to change number of rows SAS uses, which, to me, means this option does...what?).
2) If I use PROC IMPORT with a CSV file, I can specify GUESSINGROWS=32767 and I get really excited because it then determines the variables with the asterisks are character and maintains the asterisks. However, it very strangely no longer determines the variables with parentheses as character (as it would have done when importing an Excel file as long as the parenthesis was in the first 20 lines), but instead removes the character and additionally rounds the value to the nearest whole number (0.1435980234 becomes 0, 1.82149023843 becomes 2, etc.). This is way too coarse of rounding - I need to maintain the decimal places. And, on top of that, the parentheses are now gone so I can't make the appropriate cells missing. I do not know if there is a way to make SAS not round and/or maintain the parentheses. To me, this is inconsistent behavior - why is the asterisk but not a parenthesis considered a character in this case? Also, when I read in the Excel file w/ PROC IMPORT (as described in (1)), it can cope w/ the parentheses (if they appear in the first 20 lines) - another inconsistency.
3) If I use INFILE, well - I get an error w/ every variable I try to read in - this procedure is way too sensitive and unstable for how varying the data are (and I have to code a work-around for the blank data lines).
ULTIMATE GOAL (note this code will be run automatically within a macro, if that matters):
1) Read date variable as a date
2) Read time variable as time
3) Be able to identify a variable w/ characters present in any cell of that variable (even after 20 lines) as a character variable and maintain the values in the cells (i.e. don't round/delete character). This can be by a priori telling SAS to let a certain set of variables be character (I will change them to numeric after I get rid of characters/make cells missing), or by SAS identifying variables w/ characters on its own.
SAS actually by default uses the first 8 rows. That is defined in a registry setting, TYPEGUESSROWS - which is normally stored in HKLM\Software\Microsoft\Office\14.0\Access Connectivity Engine\Engines\Excel\TypeGuessRows\ (or insert-your-office-version-there). Change that value to FFFF (hex)/65536 (decimal) or some other large number, or zero to search the maximum number of rows (a bit over 16000 - exact number is hard to find).
For a CSV file, you can write a data step import to control the formats of each variable. The easiest way to see this is to run the PROC IMPORT, then check your log; the log will contain the complete code used to read in the file in a data step. Then just modify the informats as needed. You say you have too much trouble with Infile method, so perhaps this won't work for you, but typically you can work around any inconsistencies - and if your files are THAT inconsistent it sounds like you'll be doing a ton of manual work anyway. This gives you the options to read in date/time variables correctly as well.
You also can use PROC IMPORT/CSV to the log, writing the log out to a file, then read THAT in and generate new import code on your own - or even off a proc contents of the generated file, making known modifications.
Not sure what you're asking about with date/time as you don't mention issues with it in the first part of your question.
One additional option you have is to clean out the characters before it's read in (from the CSV). This is pretty simple, if it's truly just numerics and commas (and decimals and negative signs):
data mydata;
infile myfile /*options*/;
input ##;
length infileline $32767; *or your longest reasonable line;
infileline = compress(_infile_,'.-','kd');
run;
data _null_;
set mydata;
file myfile /*options*/ /*or a new file if you prefer */;
put #1 infileline $32767.; *or your longest reasonable line;
run;
Then read that new file in using proc import. I am splitting it into two datasteps so you can see it, but you could combine them into one for ease of running - look up "updating a file in place" in SAS documentation. You could also accomplish this cleaning using OS specific tools; on Unix for example a short awk script could easily remove the misbehaving characters.