Converting string to numeric in Stata - string

I have survey data with the age of individuals in a variable named agen. Originally, the variable was string so I converted it to numeric using the encode command. When I tried to generate a new variable hhage referring to the age of head of household, the new variable generated was inconsistent.
The commands I used are the following:
encode agen, gen(age)
gen hhage=age if relntohrp==1
The new variable generated is not consistent because when I browsed it: the age of the hh head in the first houshehold is 65 while the new number generated was 63. When I checked the second household, the variable hhage reported 28 instead of 33 as the head of the housheold head. And so on.

Run help encode and you can read:
Do not use encode if varname contains numbers that merely happen to be stored
as strings; instead, use generate newvar = real(varname) or destring;
see real() or [D] destring.
For example:
clear all
set more off
input id str5 age
1 "32"
2 "14"
3 "65"
4 "54"
5 "98"
end
list
encode age, gen(age2)
destring age, gen(age3)
list, nolabel
Note the difference between using encode and destring. The former assigns numerical codes (1, 2, 3, ...) to the string values, while destring converts the string value to numeric. This you see stripping the value labels when you list:
. list, nolabel
+------------------------+
| id age age3 age2 |
|------------------------|
1. | 1 32 32 2 |
2. | 2 14 14 1 |
3. | 3 65 65 4 |
4. | 4 54 54 3 |
5. | 5 98 98 5 |
+------------------------+
A simple list or browse may confuse you because encode assigns the sequence of natural numbers but also assigns value labels equal to the original strings:
. list
+------------------------+
| id age age3 age2 |
|------------------------|
1. | 1 32 32 32 |
2. | 2 14 14 14 |
3. | 3 65 65 65 |
4. | 4 54 54 54 |
5. | 5 98 98 98 |
+------------------------+
The nolabel option shows the "underlying" data.
You mention it is inconsistent, but for future questions posting exact input and results is more useful for those trying to help you.

Try taking a look at this method? Sounds like you may have slipped up somewhere in your method.

Related

Counting 15's in Cribbage Hand

Background
This is a followup question to my previous finding a straight in a cribbage hand question and Counting Pairs in Cribbage Hand
Objective
Count the number of ways cards can be combined to a total of 15, then score 2 points for each pair. Ace worth 1, and J,Q,K are worth 10.
What I have Tried
So my first poke at a solution required 26 different formulas. Basically I checked each possible way to combine cards to see if the total was 15. 1 way to add 5 cards, 5 ways to add 4 cards, 10 ways to add 3 cards, and 10 ways to add 2 cards. I thought I had this licked until I realized I was only looking at combinations, I had not considered the fact that I had to cap the value of cards 11, 12, and 13 to 10. I initially tried an array formula something along the lines of:
MIN(MOD(B1:F1-1,13)+1,10)
But the problem with this is that MIN takes the minimum value of all results not the individual results compared to 10.
I then tried it with an IF function, which worked, but involved the use of CSE formula even wehen being used with SUMPRODUCT which is something I try to avoid when I can
IF(MOD(B1:F1-1,13)+1<11,MOD(B1:F1-1,13)+1,10)
Then I stumble on an answer to a question in code golf which I modified to lead me to this formula, which I kind of like for some strange reason, but its a bit long in repetitive use:
--MID("01020304050607080910101010",1+(MOD(B1:F1-1,13)*2),2)
My current working formulas are:
5 card check
=(SUMPRODUCT(--MID("01020304050607080910101010",1+(MOD(B1:F1-1,13)*2),2))=15)*2
4 card checks
=(SUM(AGGREGATE(15,6,--MID("01020304050607080910101010",1+(MOD(B1:F1-1,13)*2),2),{1,2,3,4}))=15)*2
=(SUM(AGGREGATE(15,6,--MID("01020304050607080910101010",1+(MOD(B1:F1-1,13)*2),2),{1,2,3,5}))=15)*2
=(SUM(AGGREGATE(15,6,--MID("01020304050607080910101010",1+(MOD(B1:F1-1,13)*2),2),{1,2,4,5}))=15)*2
=(SUM(AGGREGATE(15,6,--MID("01020304050607080910101010",1+(MOD(B1:F1-1,13)*2),2),{1,3,4,5}))=15)*2
=(SUM(AGGREGATE(15,6,--MID("01020304050607080910101010",1+(MOD(B1:F1-1,13)*2),2),{2,3,4,5}))=15)*2
3 card checks
same as 4 card checks using all combinations for 3 cards in the {1,2,3}.
There are 10 different combinations, so 10 different formulas.
The 2 card check was based on the solution by Tom in Counting Pairs in Cribbage Hand and all two cards are checked with a single formula. (yes it is CSE)
2 card check
{=SUM(--(--MID("01020304050607080910101010",1+(MOD(B1:F1-1,13)*2),2)+TRANSPOSE(--MID("01020304050607080910101010",1+(MOD(B1:F1-1,13)*2),2))=15))}
Question
Can the 3 and 4 card combination sum check be brought into a single formula similar to the 2 card check?
Is there a better way to convert cards 11,12,13 to a value of 10?
Sample Data
| B | C | D | E | F | POINTS
+----+----+----+----+----+
| 1 | 2 | 3 | 17 | 31 | <= 2 (all 5 add to 15)
| 1 | 2 | 3 | 17 | 32 | <= 2 (Last 4 add to 15)
| 11 | 18 | 31 | 44 | 5 | <= 16 ( 4x(J+5), 4X(5+5+5) )
| 6 | 7 | 8 | 9 | 52 | <= 4 (6+9, 7+8)
| 1 | 3 | 7 | 8 | 52 | <= 2 (7+8)
| 2 | 3 | 7 | 9 | 52 | <= 2 (2+3+K)
| 2 | 4 | 6 | 23 | 52 | <= 0 (nothing add to 15)
Excel Version
Excel 2013
For 5:
=(SUMPRODUCT(CHOOSE(MOD(A1:E1-1,13)+1,1,2,3,4,5,6,7,8,9,10,10,10,10))=15)*2
For 4:
=SUMPRODUCT(--(MMULT(INDEX(CHOOSE(MOD(A1:E1-1,13)+1,1,2,3,4,5,6,7,8,9,10,10,10,10)*ROW($1:$10)^0,ROW($1:$5),{1,2,3,4;1,2,3,5;1,2,4,5;1,3,4,5;2,3,4,5}),ROW($1:$4)^0)=15))*2
For 3
=SUMPRODUCT(--(MMULT(INDEX(CHOOSE(MOD(A1:E1-1,13)+1,1,2,3,4,5,6,7,8,9,10,10,10,10)*ROW($1:$10)^0,ROW($1:$10),{1,2,3;1,2,4;1,2,5;1,3,4;1,3,5;1,4,5;2,3,4;2,3,5;2,4,5;3,4,5}),ROW($1:$3)^0)=15))*2
For 2:
SUMPRODUCT(--((CHOOSE(MOD(A1:E1-1,13)+1,1,2,3,4,5,6,7,8,9,10,10,10,10))+(TRANSPOSE(CHOOSE(MOD(A1:E1-1,13)+1,1,2,3,4,5,6,7,8,9,10,10,10,10)))=15))
All together:
=(SUMPRODUCT(CHOOSE(MOD(A1:E1-1,13)+1,1,2,3,4,5,6,7,8,9,10,10,10,10))=15)*2+
SUMPRODUCT(--(MMULT(INDEX(CHOOSE(MOD(A1:E1-1,13)+1,1,2,3,4,5,6,7,8,9,10,10,10,10)*ROW($1:$10)^0,ROW($1:$5),{1,2,3,4;1,2,3,5;1,2,4,5;1,3,4,5;2,3,4,5}),ROW($1:$4)^0)=15))*2+
SUMPRODUCT(--(MMULT(INDEX(CHOOSE(MOD(A1:E1-1,13)+1,1,2,3,4,5,6,7,8,9,10,10,10,10)*ROW($1:$10)^0,ROW($1:$10),{1,2,3;1,2,4;1,2,5;1,3,4;1,3,5;1,4,5;2,3,4;2,3,5;2,4,5;3,4,5}),ROW($1:$3)^0)=15))*2+
SUMPRODUCT(--((CHOOSE(MOD(A1:E1-1,13)+1,1,2,3,4,5,6,7,8,9,10,10,10,10))+(TRANSPOSE(CHOOSE(MOD(A1:E1-1,13)+1,1,2,3,4,5,6,7,8,9,10,10,10,10)))=15))
For older versions we need to "trick" INDEX into accepting the arrays as Row and Column References:
We do that by using N(IF({1},[thearray]))
=(SUMPRODUCT(CHOOSE(MOD(A1:E1-1,13)+1,1,2,3,4,5,6,7,8,9,10,10,10,10))=15)*2+
SUMPRODUCT(--(MMULT(INDEX(CHOOSE(MOD(A1:E1-1,13)+1,1,2,3,4,5,6,7,8,9,10,10,10,10)*ROW($1:$10)^0,N(IF({1},ROW($1:$5))),N(IF({1},{1,2,3,4;1,2,3,5;1,2,4,5;1,3,4,5;2,3,4,5}))),ROW($1:$4)^0)=15))*2+
SUMPRODUCT(--(MMULT(INDEX(CHOOSE(MOD(A1:E1-1,13)+1,1,2,3,4,5,6,7,8,9,10,10,10,10)*ROW($1:$10)^0,N(IF({1},ROW($1:$10))),N(IF({1},{1,2,3;1,2,4;1,2,5;1,3,4;1,3,5;1,4,5;2,3,4;2,3,5;2,4,5;3,4,5}))),ROW($1:$3)^0)=15))*2+
SUMPRODUCT(--((CHOOSE(MOD(A1:E1-1,13)+1,1,2,3,4,5,6,7,8,9,10,10,10,10))+(TRANSPOSE(CHOOSE(MOD(A1:E1-1,13)+1,1,2,3,4,5,6,7,8,9,10,10,10,10)))=15))
This is a CSE That must be confirmed with Ctrl-Shift-Enter instead of Enter when exiting edit mode.

Adding zeros to a string without generating a new variable

I am trying to add zeros to a string variable in such a way that all levels of the variables have same number of digits (assume 3).
clear
input tina bina str4 pine
1 10 "99"
1 11 "99"
2 11 "99"
2 11 "99"
3 12 "."
4 12 "888"
5 14 "88"
6 15 "777"
7 16 "77"
8 17 "0"
8 18 "7"
end
I managed to do this by generating a new variable which stores the number of digits I need to add to each observation in order to reach 3:
generate pi=3-strlen(pine)
replace pine= ("0"*pi) + pine if strlen(pine)<3
I wonder if there is a way to obtain the same result but without generating the variable?
I tried the following but it does not work :
replace pine= ("0"*(`=3-strlen(pine)')) + pine if strlen(pine)<3
Probably I am not so clear about what happens when I evaluate expressions.
Your approach does not work because it evaluates the expression for the first observation only:
. display `= 3 - strlen(pine)'
1
The single quotes are not required:
replace pine = ("0" * (3-strlen(pine) ) ) + pine if strlen(pine) < 3
+--------------------+
| tina bina pine |
|--------------------|
1. | 1 10 099 |
2. | 1 11 099 |
3. | 2 11 099 |
4. | 2 11 099 |
5. | 3 12 00. |
|--------------------|
6. | 4 12 888 |
7. | 5 14 088 |
8. | 6 15 777 |
9. | 7 16 077 |
10. | 8 17 000 |
|--------------------|
11. | 8 18 007 |
+--------------------+
I know there is already an accepted answer, but I wanted to throw out my suggestion. This is maybe a little bit simpler than the other answer and is straightforward to explain. You just want to replace a string variable of real numbers with leading zeros and keep it as a string. You can easily do this by running:
replace pine = string(real(pine),"%03.0f")
Depending on your goal this is maybe better than the previous answer, because it maintains your missing value as missing and not add zeros to it. Hopefully this helpful.

Excel Ranking tie assistance

Can anyone help me to do the following in relation to ties when using the Excel Rank function?
Col A contains scores and B contains the rank. I am quite happy with this except that I would like to show an '=' next to the ranking where it is a tie:
Score Rank
66 3
64 4=
63 6
68 2
64 4=
81 1
etc
Many thanks.
You can combine your RANK with COUNITF. Place the following into cell B3 as per the example
=RANK(A3,$A$3:$A$7)&IF(COUNTIF($A$3:$A$7,A3)>1,"=","")
Note, if you are using Excel 2013 or 2016, it would be a good idea to replace RANK with RANK.EQ
This can be done in another column next to the column where you have ranked it.
Step 1: Rank the numbers in a simple manner using RANK.EQ
Output:
(A)|(B)
66 | 3
64 | 4
63 | 6
68 | 2
64 | 4
81 | 1
Step 2: In another column use the code IF(COUNTIF(A:A,A1)>1,CONCATENATE(B1,"="), B1)
Output:
(A)|(B)|(C)
66 | 3 | 3
64 | 4 | 4=
63 | 6 | 6
68 | 2 | 2
64 | 4 | 4=
81 | 1 | 1
You can paste the values and remove the columns as required.
Hope it helps. :)

Listing nonmissing string variables in Stata

I want to display (list) the value of a string variable DE15_WHY in Stata only when it is not missing (e.g. some subjects did not provide comments). I thought this would be easy:
list DE15_WHY if DE15_WHY != ""
This displays DE15_WHY for all subjects even if they do not have anything in DE15_WHY...
Is the string formatted wrongly? For example, does Stata think that all subjects have a valid observation for DE15_WHY? How do I fix this? I checked, and it is formatted as a string variable.
Stata also allows me to tabulate DE15_WHY, similar to R. This is a great option but does not display the entire contents of the string variable in the table. How do I get Stata to display the entire string?
#Metrics' answer has several good details, but I will here add more.
With string variables, Stata has only one definition of missing, namely that a string is empty, and contains precisely no characters.
One or more spaces, despite usually conveying nothing to people, do not qualify as missing so far as Stata is concerned.
The term "blank" is perhaps unclear here and thus better avoided.
If spaces somehow get into your string variables a condition such as
if trim(mystring) == ""
selects values that are empty or that have spaces and correspondingly a condition such as
if trim(mystring) != ""
selects values with other content. To replace spaces with empty strings, we thus go
replace mystring = "" if trim(mystring) == ""
In general, if you have rather long strings, Stata necessarily has a problem of where to display them. One tip is that list will show more than tabulate. If you want a tabulate and list hybrid, check out groups from SSC, using ssc inst groups.
Although the period . is the default or system missing value for numeric variables (or numeric scalars or matrix elements) in Stata, it does not attach any special meaning to the string ".".
sysuse auto
list rep78 in 1/10 if rep78 !=. # for non-missing
tab rep78 # default behaviour is to report only non-missing
tab rep78, missing # if you want also missing
If variable is a string with missing indicated by .
list yourvariable if yourvariable !="."
If variable is a string with missing indicated by blank
list yourvariable if yourvariable !=""
Example:
my my1
ab 1
cd 2
3
ef 4
list my if my !=""
+----+
| my |
|----|
1. | ab |
2. | cd |
4. | ef |
+----+
tab will treat both blank and . as missing.
.
tab my
my | Freq. Percent Cum.
------------+-----------------------------------
ab | 1 33.33 33.33
cd | 1 33.33 66.67
ef | 1 33.33 100.00
------------+-----------------------------------
Total | 3 100.00
tab my,missing
my | Freq. Percent Cum.
------------+-----------------------------------
| 1 25.00 25.00
ab | 1 25.00 50.00
cd | 1 25.00 75.00
ef | 1 25.00 100.00
------------+-----------------------------------
Total | 4 100.00

Multiple-digit numbers getting split by space in NuGram?

I'm seeing some unexpected behavior in the NuGram IDE Eclipse plug-in for ABNF grammar development.
Say I have a rule that reads:
$fifties =
50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59
;
The sentence generator comes up with the matches 5 0, 5 1, 5 2, ... I would normally expect 50, 51, 52, and so forth, but according to NuGram's coverage tool these are considered OOG.
Come to find that it will split any multiple-digit number with spaces, unless there's a leading non-number:
1234 -> 1 2 3 4
1234asdf -> 1 2 3 4 asdf
asdf1234 -> asdf1234
1234asdf5678 -> 1 2 3 4 asdf5678
As far as I know, a normal ABNF grammar wouldn't do this. Or am I forgetting something?
This is because NuGram IDE considers digits as individual DTMF tones. I agree that this behaviour should only apply to DTMF grammars and not voice grammars.
You can surround sequences of digits with double quotes, like:
$fifties =
"50" | "51" | "52" | "53" | "54" | "55" | "56" | "57" | "58" | "59"
;
Hope that helps!

Resources