reading data from excel into SAS and converting character to numeric - excel

I am reading data from an excel file in SAS and inserting the values to an oracle table. The oracle table has a numeric column. If the excel file has numbers, it works fine. But if the column is left blank in the excel file, it is read as a character value and insertion to oracle fails.
Is it possible to convert the column to numeric if its is blank, but read it as is if its has a number?
Thanks!

Let's assume that SAS is reading this column as a character and you cannot convert it directly within the file. This happens sometimes: maybe you don't have authorization to do it, or maybe it's just not working like you're expecting. SAS can go from character to numeric and numeric to character with two functions: input() and put().
Going from Character to Numeric: input()
input() is for changing character data into numbers.
This is great for reading in dates, currency, comma-separated numbers, etc. If you need your data as a number, use this function. Its syntax is:
num_var = input(char_var, informat.);
In your case, let's say we always expect numbers to be here even if it's missing. We'll use the 8. informat for our variable of interest, my_var.
data want;
set have_excel(rename=(my_var = my_var_char) );
my_var = input(my_var_char, 8.);
drop my_var_char;
run;
Note that we need to create a new variable. We rename the variable of interest to something else, then create a new version of the variable of interest that is a number. In SAS, just like many other languages and database systems, when a variable is declared as a character or number, it is always a character or a number.
Going from Numeric to Character: put()
put() is for putting a number to a character or a character to another character.
This is great for converting SAS dates to characters, adding custom formats, converting a character to another character, etc. The syntax is:
char_var = put(num_var, format.);
OR:
char_var = put(char_var, format.);
Note the previous use case: with put(), you can convert characters to other characters. This is very handy for standardizing values or even merging data using a format.
For example: let's convert a number to a comma-separated character number.
data want;
char_number = put(1234, comma.);
run;
Output:
char_number
1,234

Below case statement worked for me.
case
when missing(input(cats(COLUMN_VALUE), best8.)) THEN input(cats(COLUMN_VALUE), best8.)
when not missing(input(cats(COLUMN_VALUE), best8.)) THEN input(cats(COLUMN_VALUE), best8.)
end as COLUMN_VALUE

Related

How to convert a string containing non-numeric values into numeric values?

I have several variables of the form:
1 gdppercap
2 19786,97
3 20713,737
4 20793,163
5 23070,398
6 5639,175
I have copy-pasted the data into Stata, and it thinks they are strings. So far I have tried:
destring gdppercap, generate(gdppercap_n)
but get
gdppercap contains nonnumeric characters; no generate
And:
encode gdppercap, gen(gdppercap_n)
but get a variable numbered from 1 to 1055 regardless of the previous value.
Also I've tried:
gen gdppercap_n = real(gdppercap)
But get:
(1052 missing values generated)
Can you help me? As far as I can tell, Stata do not like the fact that the variable contains fraction numbers.
If I understand you correctly, the interpretation as string arises from one and possibly two facts:
The variable name may be echoed in the first observation. If so, that's text and it's inconsistent with a numeric variable. The root problem there is likely to be a copy-and-paste operation that copied too much. Stata typically gives you a choice when importing by copy-and-paste of whether the first row of what you copied is to be treated as variable names or as data, and you need the first choice, so that column headers become variable names, not data. It may be best to go back and do the copy-and-paste correctly. However, Stata can struggle with multiple header lines in a spreadsheet. Alternatively, use import excel, not a copy-and-paste. Alternatively, drop in 1 to remove the first observation, provided that it consistently is superfluous.
Commas indicate decimal places. destring can easily cope with this: see the help for its dpcomma option. Stata has no objection to fractions; that would be absurd. The problem is that you need to flag your use of commas.
Note that
destring is a wrapper for real(), so real() is not a way round this.
encode is for mapping genuine categorical variables to integers, as you discovered, and as its help does explain. It is not for fixing data input errors.
You can write a for loop to convert a comma to a period. I don't quite know your variables but imagine you have a variable gdppercap with information like 1234,343 and you want that to be 1234.343 before you do the destring.
For example:
forvalues x = 1(1)10 {
replace gdppercap = substr(gdppercap, 1, `x'-1) + "." + substr(gdppercap, `x'+1, .)
if substr(gdppercap, `x', 1) == ","
}

Excel Text formatting does not applying in Office writer reports

I'm using office writer reports to export data to excel. In that I have some reference number field which is 00033444. I have set the cell formatting to "text". But it still displays without leading zeros like 3344.
I'm using Office Writer 8.4 version.
Any help plz?
OfficeWriter's ExcelTemplate approach will always attempt to convert numerical strings to numbers and there are several options to ensure that your numerical strings are preserved:
Option 1: In your code, set ExcelTemplate.PreserveStrings to TRUE. This will import all numerical strings as strings.
Option 2: In your template file, add the 'Preserve' modifier to the data marker that corresponds to the reference number field. For example, %%=DataSet.ReferenceField(Preserve).This will import numerical strings from that column of data (i.e. ReferenceField) as strings instead of numbers.
You can set the number formatting of the cell that contains the data marker to be text, but it is not necessary to preserve numerical strings. If you use one the options above, the numerical strings will be imported as strings, regardless of the number format in the template.

Loading dataset containing both strings and number

I'm trying to load the following dataset:
Afghanistan,5,1,648,16,10,2,0,3,5,1,1,0,1,1,1,0,green,0,0,0,0,1,0,0,1,0,0,black,green
Albania,3,1,29,3,6,6,0,0,3,1,0,0,1,0,1,0,red,0,0,0,0,1,0,0,0,1,0,red,red
Algeria,4,1,2388,20,8,2,2,0,3,1,1,0,0,1,0,0,green,0,0,0,0,1,1,0,0,0,0,green,white
...
Problem is it contains both integers and strings.
I found some information on how to get out the integers only.
But haven't been able to see if there's any way to get all the data.
My question is that possible ??
If that is not possible, is there then any way to find the numbers on each line and throw everything else away without having to choose the columns?
I need specifically since it seems I cannot use str2num on a whole line at a time.
Almost anything is possible, you just have to define your goal accurately.
Assuming that your database is stored as a text file, you can parse it line by line using textread, and then apply regexp to filter only the numerical fields (this does not require having prior knowledge about the columns):
C = textread('database.txt', '%s', 'delimiter', '\n');
C = cellfun(#(x)regexp(x, '\d+', 'match'), C, 'Uniform', false);
The result here is a cell array of cell array of strings, where each string corresponds to a numerical field in a specific line.
Since the numbers are still stored as strings, you'd probably need to convert them to actual numerical values. There's a multitude of ways to do that, but you can use str2num in a tricky way: it can convert delimited strings into an array of numbers. This means that if you concatenate all strings in a specific line back into one string, and put spaces in between, you can apply str2num on all of them at once, like so:
C = cellfun(#(x)str2num(sprintf('%s ', x{:})), C, 'Uniform', false);
The resulting C is a cell array of vectors, each vector containing the values of all numerical fields in the corresponding line. To access a specific vector, you can use curly braces ({}). For instance, to access the numbers of the second line, you would use C{2}.
All the non-numerical fields are discarded in the process of parsing, of course. If you want to keep them as well, you should use a different regular expression with regexp.
Good luck!

Reading mix between numeric and non-numeric data from excel into Matlab

I have a matrix where the first column contains dates and the first row contains maturities which are alpha/numeric (e.g. 16year).
The rest of the cells contain the rates for each day, which are double precision numbers.
Now I believe xlsread() can only handle numeric data so I think I will need something else or a combination of functions?
I would like to be able to read the table from excel into MATLAB as one array or perhaps a struct() so that I can keep all the data together.
The other problem is that some of the rates are given as '#N/A'. I want the cells where these values are stored to be kept but would like to change the value to blank=" ".
What is the best way to do this? Can it be done as part of the input process?
Well, from looking at matlab reference for xlsread you can use the format
[num,txt,raw] = xlsread(FILENAME)
and then you will have in num a matrix of your data, in txt the unreadable data, i.e. your text headers, and in raw you will have all of your data unprocessed. (including the text headers).
So I guess you could use the raw array, or a combination of the num and txt.
For your other problem, if your rates are 'pulled' from some other source, you can use
=IFERROR(RATE DATA,"")
and then there will be a blank instead of the error code #N\A.
Another solution (only for Windows) would be to use xlsread() format which allows running a function on your imported data,
[num,txt,raw,custom] = xlsread(filename,sheet,xlRange,'',functionHandler)
and let the function replace the NaN values with blank spots. (and you will have your output in the custom array)

Comma separators in Fortran

I have come across the following issue with Fortran: that in reading a character array, for example, or any list in actuality, from a data file with fmt=*, both non-interquote blanks AND commas are natively considered as delimiters for the elements in the array/list. The fact that commas act as delimiters is a big problem for me.
So the question is: do you know of any semantic option or compilation directive in Fortran that permits to consider the commas in input files as characters and not as delimiters,
with the only delimiters being blanks? As an specific example, I would like that when reading a record like:
x,y,z
with:
read (7,*) adummy
would result in adummy (a scalar character variable) getting the value x,y,z not x.
Any help would be most welcome.
The solution is to specify formatting to match your data record, i.e. use character data descriptor when specifying the format:
read(7,fmt='(A)')adummy
will result in adummy having value x,y,z, assuming it is a variable of sufficient length.
However this method will not treat blanks as delimiters either, so if you want to read commas as character strings but have blanks as delimiter, the common way to achieve this is to read the whole record into the character variable and do the splitting into separate variables afterwards.

Resources