DATA STRUCTURE: I have a data set that can be read as an Excel or CSV file. It has the following variable types: dates, time, numeric variables, and what should be numeric variables that incorrectly have characters attached to a number - e.g. -0.011* and 0.023954029324) (the parentheses at the end is in the cell) - due to an error in the program that wrote the file. There are also blank lines between every record, and it is not realistic to delete all of these as I have hundreds of files to manage.
DATA ISSUE: We've determined that some values are correct up to the character (i.e. -0.011 is correct as long as the asterisk is removed), while other values, such as 0.023954029324), are incorrect altogether and should be made missing. Please don't comment on this issue as it is out of my control and at this point all I can do is manage the data until the error is fixed and character values stop being written into the files.
PROBLEM WITH SAS:
1) If I use PROC IMPORT with an Excel file, SAS uses the first eight lines (20 for a CSV file) to determine if a variable is numeric or character. If the asterisk of parenthesis doesn't occur within the first 20 lines, SAS says the variable is numeric, then makes any later cells w/ character values missing. This is NOT okay in the case of the asterisks, because I want to maintain the numeric portion of the value and remove the asterisk in a later data step. Importing Excel files with PROC IMPORT does not allow the GUESSINGROWS option (as it does w/ CSV files, see below). Edit: Also, the MIXED=YES option does NOT work (see comments below - still need to change number of rows SAS uses, which, to me, means this option does...what?).
2) If I use PROC IMPORT with a CSV file, I can specify GUESSINGROWS=32767 and I get really excited because it then determines the variables with the asterisks are character and maintains the asterisks. However, it very strangely no longer determines the variables with parentheses as character (as it would have done when importing an Excel file as long as the parenthesis was in the first 20 lines), but instead removes the character and additionally rounds the value to the nearest whole number (0.1435980234 becomes 0, 1.82149023843 becomes 2, etc.). This is way too coarse of rounding - I need to maintain the decimal places. And, on top of that, the parentheses are now gone so I can't make the appropriate cells missing. I do not know if there is a way to make SAS not round and/or maintain the parentheses. To me, this is inconsistent behavior - why is the asterisk but not a parenthesis considered a character in this case? Also, when I read in the Excel file w/ PROC IMPORT (as described in (1)), it can cope w/ the parentheses (if they appear in the first 20 lines) - another inconsistency.
3) If I use INFILE, well - I get an error w/ every variable I try to read in - this procedure is way too sensitive and unstable for how varying the data are (and I have to code a work-around for the blank data lines).
ULTIMATE GOAL (note this code will be run automatically within a macro, if that matters):
1) Read date variable as a date
2) Read time variable as time
3) Be able to identify a variable w/ characters present in any cell of that variable (even after 20 lines) as a character variable and maintain the values in the cells (i.e. don't round/delete character). This can be by a priori telling SAS to let a certain set of variables be character (I will change them to numeric after I get rid of characters/make cells missing), or by SAS identifying variables w/ characters on its own.
SAS actually by default uses the first 8 rows. That is defined in a registry setting, TYPEGUESSROWS - which is normally stored in HKLM\Software\Microsoft\Office\14.0\Access Connectivity Engine\Engines\Excel\TypeGuessRows\ (or insert-your-office-version-there). Change that value to FFFF (hex)/65536 (decimal) or some other large number, or zero to search the maximum number of rows (a bit over 16000 - exact number is hard to find).
For a CSV file, you can write a data step import to control the formats of each variable. The easiest way to see this is to run the PROC IMPORT, then check your log; the log will contain the complete code used to read in the file in a data step. Then just modify the informats as needed. You say you have too much trouble with Infile method, so perhaps this won't work for you, but typically you can work around any inconsistencies - and if your files are THAT inconsistent it sounds like you'll be doing a ton of manual work anyway. This gives you the options to read in date/time variables correctly as well.
You also can use PROC IMPORT/CSV to the log, writing the log out to a file, then read THAT in and generate new import code on your own - or even off a proc contents of the generated file, making known modifications.
Not sure what you're asking about with date/time as you don't mention issues with it in the first part of your question.
One additional option you have is to clean out the characters before it's read in (from the CSV). This is pretty simple, if it's truly just numerics and commas (and decimals and negative signs):
data mydata;
infile myfile /*options*/;
input ##;
length infileline $32767; *or your longest reasonable line;
infileline = compress(_infile_,'.-','kd');
run;
data _null_;
set mydata;
file myfile /*options*/ /*or a new file if you prefer */;
put #1 infileline $32767.; *or your longest reasonable line;
run;
Then read that new file in using proc import. I am splitting it into two datasteps so you can see it, but you could combine them into one for ease of running - look up "updating a file in place" in SAS documentation. You could also accomplish this cleaning using OS specific tools; on Unix for example a short awk script could easily remove the misbehaving characters.
Related
I can find several topics and solutions on importing from excel to SAS and how to deal with variable names containing blanks or spaces.
However, in my situation, some of the variable values contain spaces at the end, and after importing I can see the trailing blanks, but compress does not remove them.
I'm thinking they're some other type of character. I've tried some modifiers on the compress function, but cannot seem to make it recognize these spaces.
Because I'm often creating different excel files, I would prefer not having to remove the blanks manually. Is there an option to the proc import step I should add, or is there a modifier I can provide to the compress function to solve this?
I'm using the following basic code to import:
proc import out = METADATA
datafile = "&mdata\meta_data.xlsx"
DBMS = Excel replace;
SHEET = "Sheet1";
GETNAMES = YES;
run;
EDIT (after implementing instructions from comments):
I don't really know how my component of SAS is called - I started working with SAS recently.
I'm using some kind of editor, with a VIEWTABLE window. When looking at my dataset this way, I can select (as in highlight) the variable values. One of my values has a trailing whitespace - I can highlight a finite space beyond the string, which I can't for the other variables. And I know the space is there because I have put it there in excel as well.
The length of my variable is 8, and setting the format to $HEX128 shows:
DOSE 444F534520202020
DOSE2 444F534532A02020
DOSE2 contains the blank space so it's actually 'DOSE2 ' in excel and in the VIEWTABLE.
When converting from string to hex I believe '2' is converted to 32.
That means the whitespace is converted to 'A0' instead of '20'.
Just as a reference for other people searching on these keywords or this topic:
After importing from excel where your values contain spaces, you might end up with a special kind of whitespace: these are non-breaking spaces.
You can find out by setting the format to $HEX128. - the whitespaces should be converted to A0 instead of 20, used for regular whitespaces.
If you want to remove these, you can use var = compress(var, 'A0'x);
I have a string variable with lots of parentheses and other punctuation e.g. _LSC Debt licensed work. How can I easily convert it to a numeric variable when I already have a specified code list for it? i.e. I don't want it to automatically recode everything because it uses the wrong values against the labels.
Create a dataset with two variables: a string holding the current messy name and a numeric variable holding the new code. Then, with both the original dataset and the lookup one sorted by the string, do MATCH FILES specifying a table match (or use Data > Merge Files > Add Variables).
You can prepare a separate file which includes two variables:
- one contains each of the possible values in the original string variable to be recoded (make sure the name and width are the same as your original variable)
- the second contains the new values you want to recode to.
when you set this up, match the files like this:
get file="filepath\Your_Value_Table.sav".
sort cases by YourOriginalVarName.
dataset name ValTab.
get file="filepath\Your_Original_File.sav".
sort cases by YourOriginalVarName.
match files /file=* /table=ValTab /by YourOriginalVarName.
exe.
At this point your original file will contain a new variable that has the codes you wanted.
In general I agree with the solution provided by others. However, I would like to suggest an extra step, which could make your look-up file (see the answer of eli-k and JKP) a bit better.
The point is that your string variable with lots of parentheses and other punctuation probably also has different ways to write the same thing.
For example:
_LSC Debt licensed work
LSC Debt licensed work
_LSC Debt Licensed Work
etc.
You could create a lookup-table with three variables: the unique values of the original string variable, a cleaned-up version of that variable, and finally the numeric value you want to attach.
The advantage of the cleaned-up version is that you can identify more easily the same value although it is written differently.
You could clean up using several functions:
string CleanedUpVersion (A40).
compute CleanedUpVersion = REPLACE(RTIM(LTRIM(UPCASE(YourOriginalVarName))),'_','').
execute.
In this basic example we convert to capital letters, delete leading and trailing blanks and remove the underscore by replacing it by nothing.
Overall this could help to avoid giving different numbers to unique values in your original variable that mean the same thing, while you would like them to have the same number.
I'm working in SAS 9.2, in an existing dataset. I need a simple way to match a single word within string values of a single variable, and then replace entire string value with a blank. I don't have experience with SQL, macros, etc. and I'm hoping for a way to do this (even if the code is less efficient" that will be clear to a novice.
Specifically, I need to remove the entire string containing the word "growth" in a variable "pathogen." Sample values include "No growth during two days", "no growth," "growth did not occur," etc. I cannot enter all possible strings since I don't yet know how they will vary (we have only entered a few observations so far).
TRANSWD and TRANSLATE will not work as they will not allow me to replace an entire phrase when the target word is only a part of the string.
Other methods I've looked at (for example, a SESUG paper using PRX at http://analytics.ncsu.edu/sesug/2007/CC06.pdf) appear to remove all instances of the target string in every variable in the dataset, instead of just in the variable of interest.
Obviously I could subset the dataset to a single variable before I perform one of these actions and then merge back, but I'm hoping for something less complicated. Although I will certainly give something more complicated a shot if someone can provide me with sample code to adapt (and it would be greatly appreciated).
Thanks in advance--Kim
Could you be a little more clear on who the data set is constructed? I think mjsqu's solution will work if your variable pathogen is stored sentence by sentence. If not then I would say your best bet is to parse the blocks into sentences and then apply mjsqu's solution.
DATA dataset1;
format Ref best1.
pathogen $40.;
input Ref pathogen $40. ;
datalines;
1 No growth during two days
2 no growth,
3 growth did not occur,
4 does not have the word
;
RUN;
DATA dataout;
SET dataset1;
IF index(lowcase(pathogen),"growth") THEN pathogen="";
RUN;
I have several variables of the form:
1 gdppercap
2 19786,97
3 20713,737
4 20793,163
5 23070,398
6 5639,175
I have copy-pasted the data into Stata, and it thinks they are strings. So far I have tried:
destring gdppercap, generate(gdppercap_n)
but get
gdppercap contains nonnumeric characters; no generate
And:
encode gdppercap, gen(gdppercap_n)
but get a variable numbered from 1 to 1055 regardless of the previous value.
Also I've tried:
gen gdppercap_n = real(gdppercap)
But get:
(1052 missing values generated)
Can you help me? As far as I can tell, Stata do not like the fact that the variable contains fraction numbers.
If I understand you correctly, the interpretation as string arises from one and possibly two facts:
The variable name may be echoed in the first observation. If so, that's text and it's inconsistent with a numeric variable. The root problem there is likely to be a copy-and-paste operation that copied too much. Stata typically gives you a choice when importing by copy-and-paste of whether the first row of what you copied is to be treated as variable names or as data, and you need the first choice, so that column headers become variable names, not data. It may be best to go back and do the copy-and-paste correctly. However, Stata can struggle with multiple header lines in a spreadsheet. Alternatively, use import excel, not a copy-and-paste. Alternatively, drop in 1 to remove the first observation, provided that it consistently is superfluous.
Commas indicate decimal places. destring can easily cope with this: see the help for its dpcomma option. Stata has no objection to fractions; that would be absurd. The problem is that you need to flag your use of commas.
Note that
destring is a wrapper for real(), so real() is not a way round this.
encode is for mapping genuine categorical variables to integers, as you discovered, and as its help does explain. It is not for fixing data input errors.
You can write a for loop to convert a comma to a period. I don't quite know your variables but imagine you have a variable gdppercap with information like 1234,343 and you want that to be 1234.343 before you do the destring.
For example:
forvalues x = 1(1)10 {
replace gdppercap = substr(gdppercap, 1, `x'-1) + "." + substr(gdppercap, `x'+1, .)
if substr(gdppercap, `x', 1) == ","
}
I'm wondering if there is a way to convert an .xsxl file into .csv while preserving everything in its entirety.
I have a column that for some rows has values like 0738794E5 and when I convert it through "save as", the value turns to 7.39E+10. I understand that some values which have an "E" will be turned to the latter format but this conversion is no use to me since that "E" doesn't stand for exponentiation.
Is there a setting to preserve the values the way they are i.e. text/string?
One option is to create an additional (or replacement) column that has the target values either enclosed in double quotes or prepended by an alpha character.
The quotes or alpha character will guarantee that the problem values come in as text. When the csv file is opened, the quotes or alpha will still be there, so you would need to use a string operation (MID or RIGHT, probably) to recover the original string values.
My dilemma wasn't real and only appeared to be so.
When I convert the .xlsx into .csv and open the .csv, it shows the improperly-converted values.
However, when I run my application, read from the csv, and output what's been read, I get the values contained within the .xlsx just like I wanted.
I'm not sure how/why this is the way it is but it works now.