How to convert a string containing non-numeric values into numeric values? - string

I have several variables of the form:
1 gdppercap
2 19786,97
3 20713,737
4 20793,163
5 23070,398
6 5639,175
I have copy-pasted the data into Stata, and it thinks they are strings. So far I have tried:
destring gdppercap, generate(gdppercap_n)
but get
gdppercap contains nonnumeric characters; no generate
And:
encode gdppercap, gen(gdppercap_n)
but get a variable numbered from 1 to 1055 regardless of the previous value.
Also I've tried:
gen gdppercap_n = real(gdppercap)
But get:
(1052 missing values generated)
Can you help me? As far as I can tell, Stata do not like the fact that the variable contains fraction numbers.

If I understand you correctly, the interpretation as string arises from one and possibly two facts:
The variable name may be echoed in the first observation. If so, that's text and it's inconsistent with a numeric variable. The root problem there is likely to be a copy-and-paste operation that copied too much. Stata typically gives you a choice when importing by copy-and-paste of whether the first row of what you copied is to be treated as variable names or as data, and you need the first choice, so that column headers become variable names, not data. It may be best to go back and do the copy-and-paste correctly. However, Stata can struggle with multiple header lines in a spreadsheet. Alternatively, use import excel, not a copy-and-paste. Alternatively, drop in 1 to remove the first observation, provided that it consistently is superfluous.
Commas indicate decimal places. destring can easily cope with this: see the help for its dpcomma option. Stata has no objection to fractions; that would be absurd. The problem is that you need to flag your use of commas.
Note that
destring is a wrapper for real(), so real() is not a way round this.
encode is for mapping genuine categorical variables to integers, as you discovered, and as its help does explain. It is not for fixing data input errors.

You can write a for loop to convert a comma to a period. I don't quite know your variables but imagine you have a variable gdppercap with information like 1234,343 and you want that to be 1234.343 before you do the destring.
For example:
forvalues x = 1(1)10 {
replace gdppercap = substr(gdppercap, 1, `x'-1) + "." + substr(gdppercap, `x'+1, .)
if substr(gdppercap, `x', 1) == ","
}

Related

How can I separate consecutive strings without any delimiters?

My input data is a VCF (Variant Call Format) file. Each line that I am interested in looks like this:
chrI 22232 DEL00BED N <DEL> . PASS SUPP=1159;SUPP_VEC=11111111111111111111111011111111111
I want to count the presence (1) of a specific deletion in a specific position (22232) supported by n samples. For this reason, I looked at SUPP_VEC= values, however, I don't know how to split each value as 1) it is a string, and 2) doesn't have delimiters. How could I add a space between every character? or How could I split/ count the values from SUPP_VEC= for Python3?
I was also curious to know what SUPP means. I found oneSUPP=2and I looked on Excel if the presence(1)\abscence(0) in the SUPP_VEC counted the value of SUPP, nevertheless, I could only count 1 instead of 2, probably does somebody know what SUPP means.
The reason for my procedure is to have a frequency table for a specific deletion type.
I hope I made myself clear.
Thank you in advance.

Comparing values with some alphanumerics

I've looked through the forums but couldn't find any questions (with answers) that helped. Any guidance would be appreciated.
I'm working on an Excel/Access project that cross references error codes. The codes are twelve digits long, with the first half and second half that need to be sortable. 99% of these codes are entirely numeric, but the 1% that includes letters is really screwing me up.
For example, a common error code might be "386748000123". This would be split into "386748" and "000123", with the first being the code for the type of system and the second being the type of error.
But then the 1% are something like this: "0957AB003A41". "0957AB", and "003A41".
If I format the columns (in Excel and Access) as numbers than the numeric comparisons are far easier, "000123" equals "123". If I format the column as strings than I can compare the alphanumeric values but then "000123" and "123" stop crossing.
The possible solution I've come across is utilizing the Val function inside an Access query to purely compare values but I've never used it and it seems like only a partial fix. Val ignores the strings, which means "0957AB" will have the same value as "0957XY" - and that doesn't work for this project.
I'm sure many of you have had similar issues, so I'm hoping to get some ideas on different ways the problem has been approached and resolved.
You have not provided a minimal sample of the data and also the output, also there is no code that I can amend it for you, but the only part that you are having problem is comparing the alphanumeric ones, you should format all of your data as strings and then compare. to make 123 be equal to "000123" you need to just format the numeric ones as string as below:
format(123,"000000")
which will give you "000123"
Edit
from you comment I learned that the problem is the key that is always or often a number, format will return the proper string for comparison, if it is already a 6-character string it will return itself so there would not be a problem:
do something like this:
if format(key,"0000000")=format(code,"000000") then
'do something
end if

SPSS converting a string into a numeric variable issue

I have a string variable with lots of parentheses and other punctuation e.g. _LSC Debt licensed work. How can I easily convert it to a numeric variable when I already have a specified code list for it? i.e. I don't want it to automatically recode everything because it uses the wrong values against the labels.
Create a dataset with two variables: a string holding the current messy name and a numeric variable holding the new code. Then, with both the original dataset and the lookup one sorted by the string, do MATCH FILES specifying a table match (or use Data > Merge Files > Add Variables).
You can prepare a separate file which includes two variables:
- one contains each of the possible values in the original string variable to be recoded (make sure the name and width are the same as your original variable)
- the second contains the new values you want to recode to.
when you set this up, match the files like this:
get file="filepath\Your_Value_Table.sav".
sort cases by YourOriginalVarName.
dataset name ValTab.
get file="filepath\Your_Original_File.sav".
sort cases by YourOriginalVarName.
match files /file=* /table=ValTab /by YourOriginalVarName.
exe.
At this point your original file will contain a new variable that has the codes you wanted.
In general I agree with the solution provided by others. However, I would like to suggest an extra step, which could make your look-up file (see the answer of eli-k and JKP) a bit better.
The point is that your string variable with lots of parentheses and other punctuation probably also has different ways to write the same thing.
For example:
_LSC Debt licensed work
LSC Debt licensed work
_LSC Debt Licensed Work
etc.
You could create a lookup-table with three variables: the unique values of the original string variable, a cleaned-up version of that variable, and finally the numeric value you want to attach.
The advantage of the cleaned-up version is that you can identify more easily the same value although it is written differently.
You could clean up using several functions:
string CleanedUpVersion (A40).
compute CleanedUpVersion = REPLACE(RTIM(LTRIM(UPCASE(YourOriginalVarName))),'_','').
execute.
In this basic example we convert to capital letters, delete leading and trailing blanks and remove the underscore by replacing it by nothing.
Overall this could help to avoid giving different numbers to unique values in your original variable that mean the same thing, while you would like them to have the same number.

SAS: Match single word within string values of a single variable then replace entire string value with a blank

I'm working in SAS 9.2, in an existing dataset. I need a simple way to match a single word within string values of a single variable, and then replace entire string value with a blank. I don't have experience with SQL, macros, etc. and I'm hoping for a way to do this (even if the code is less efficient" that will be clear to a novice.
Specifically, I need to remove the entire string containing the word "growth" in a variable "pathogen." Sample values include "No growth during two days", "no growth," "growth did not occur," etc. I cannot enter all possible strings since I don't yet know how they will vary (we have only entered a few observations so far).
TRANSWD and TRANSLATE will not work as they will not allow me to replace an entire phrase when the target word is only a part of the string.
Other methods I've looked at (for example, a SESUG paper using PRX at http://analytics.ncsu.edu/sesug/2007/CC06.pdf) appear to remove all instances of the target string in every variable in the dataset, instead of just in the variable of interest.
Obviously I could subset the dataset to a single variable before I perform one of these actions and then merge back, but I'm hoping for something less complicated. Although I will certainly give something more complicated a shot if someone can provide me with sample code to adapt (and it would be greatly appreciated).
Thanks in advance--Kim
Could you be a little more clear on who the data set is constructed? I think mjsqu's solution will work if your variable pathogen is stored sentence by sentence. If not then I would say your best bet is to parse the blocks into sentences and then apply mjsqu's solution.
DATA dataset1;
format Ref best1.
pathogen $40.;
input Ref pathogen $40. ;
datalines;
1 No growth during two days
2 no growth,
3 growth did not occur,
4 does not have the word
;
RUN;
DATA dataout;
SET dataset1;
IF index(lowcase(pathogen),"growth") THEN pathogen="";
RUN;

Loading dataset containing both strings and number

I'm trying to load the following dataset:
Afghanistan,5,1,648,16,10,2,0,3,5,1,1,0,1,1,1,0,green,0,0,0,0,1,0,0,1,0,0,black,green
Albania,3,1,29,3,6,6,0,0,3,1,0,0,1,0,1,0,red,0,0,0,0,1,0,0,0,1,0,red,red
Algeria,4,1,2388,20,8,2,2,0,3,1,1,0,0,1,0,0,green,0,0,0,0,1,1,0,0,0,0,green,white
...
Problem is it contains both integers and strings.
I found some information on how to get out the integers only.
But haven't been able to see if there's any way to get all the data.
My question is that possible ??
If that is not possible, is there then any way to find the numbers on each line and throw everything else away without having to choose the columns?
I need specifically since it seems I cannot use str2num on a whole line at a time.
Almost anything is possible, you just have to define your goal accurately.
Assuming that your database is stored as a text file, you can parse it line by line using textread, and then apply regexp to filter only the numerical fields (this does not require having prior knowledge about the columns):
C = textread('database.txt', '%s', 'delimiter', '\n');
C = cellfun(#(x)regexp(x, '\d+', 'match'), C, 'Uniform', false);
The result here is a cell array of cell array of strings, where each string corresponds to a numerical field in a specific line.
Since the numbers are still stored as strings, you'd probably need to convert them to actual numerical values. There's a multitude of ways to do that, but you can use str2num in a tricky way: it can convert delimited strings into an array of numbers. This means that if you concatenate all strings in a specific line back into one string, and put spaces in between, you can apply str2num on all of them at once, like so:
C = cellfun(#(x)str2num(sprintf('%s ', x{:})), C, 'Uniform', false);
The resulting C is a cell array of vectors, each vector containing the values of all numerical fields in the corresponding line. To access a specific vector, you can use curly braces ({}). For instance, to access the numbers of the second line, you would use C{2}.
All the non-numerical fields are discarded in the process of parsing, of course. If you want to keep them as well, you should use a different regular expression with regexp.
Good luck!

Resources