I am trying to convert a column in Stata (AP1) in the photo below so that the entries are string types. Currently the entries appear as characters ("cultivateur" for example) but are shown as being of the type int.
I used the following code to try and change them to strings.
label values AP1 .
tostring AP1, replace
AP1 was long now str5
While the AP1 column becomes a string type all the characters now become integers which is not what I need in order to subset the data based on observations.
Does anyone know I can switch this column to the string type without the characters becoming integers?
Your images are barely readable -- on a phone or laptop -- but perhaps readable by anyone using a very large monitor. Please see the Stata tag wiki for guidance on presenting reproducible data examples.
What is going on can be explained reproducibly by considering
. sysuse auto, clear
(1978 automobile data)
. tab foreign
Car origin | Freq. Percent Cum.
------------+-----------------------------------
Domestic | 52 70.27 70.27
Foreign | 22 29.73 100.00
------------+-----------------------------------
Total | 74 100.00
. tab foreign, nolabel
Car origin | Freq. Percent Cum.
------------+-----------------------------------
0 | 52 70.27 70.27
1 | 22 29.73 100.00
------------+-----------------------------------
Total | 74 100.00
. tostring foreign, gen(str_foreign)
str_foreign generated as str1
. tab str_foreign
Car origin | Freq. Percent Cum.
------------+-----------------------------------
0 | 52 70.27 70.27
1 | 22 29.73 100.00
------------+-----------------------------------
Total | 74 100.00
. d *foreign
Variable Storage Display Value
name type format label Variable label
------------------------------------------------------------------------------------------------------------------
foreign byte %8.0g origin Car origin
str_foreign str1 %9s Car origin
foreign like your problematic variable is a numeric variable with value labels. (The term "column" is not standard in Stata for variables in the dataset.) Push it through tostring and you get a string variable containing integer characters. Stata did what you asked.
To get a string variable containing the text of the value labels, you need to apply the decode command, which was written for precisely this purpose (and incidentally, long predates tostring as an official command).
Related
So, the title might be confusing, so I'll outline like this:
I am making a weightloss chart. One of the clients gets to open a bag of legos as a reward for every 2lbs that he loses, as long as he does it based on a goal progression. For instance, if he weights 260, and loses 2lb, he gets his reward. However, if he gains a lb, now he has to lose 3lb to get his reward.
Currently, I have charts that look like this:
Column O
Column P
Current Weight
Amount Lost
263
8
Column L
Column M
Next Lego Bag
261
Lbs until next bag
2
After he hits 261, I want that cell that says 261 in Col M to say "259". So if he weighs in again, I want it to look like this automatically.
Column O
Column O
Current Weight
Amount Lost
260.5
10.5
Column L
Column M
Next Lego Bag
259
Lbs until next bag
1.5
What is the best way to automatically make that cell in Column M change when he hits the 2lb goal? I have a table that basically states all the goal weighs he needs to hit for each reward. It looks like this:
| Column Z | Column AA | Column AB | (formatting is being weird)
| -------- | -------- | -------- |
| Bag | Target Weight | Amount Lost |
| Bag 5 | 261 | 8 |
| Bag 6 | 259 | 10 |
| Bag 7 | 257 | 12 |
| Bag 8 | 255 | 14 |
| Bag 9 | 253 | 16 |
etc
I've tried a few things, but I'm coming up blank, because it won't always be in whole numbers the amount he loses, so matching it to the target weight has been tough.
In really, really simple terms, I need it to basically say this:
If current weight > goal 1, then A1 = goal 1. If current weight < Goal 1, then A1 = Goal 2, and all the way to Goal 21. However, A1 can't change to the next goal until current weight is less than that goal.
Thanks all
I have tried IF statements and Floor statements to get an ongoing changing thing, but it's not working.
In M2: =IF(MOD(O2+1,2)=0,2,MOD(O2+1,2))
In M1:
=O2-M2
Or using O365 in M1:
=LET(m,MOD(O2+1,2),
lbs,IF(m=0,2,m),
VSTACK(O2-lbs,lbs))
I have 154,901 rows of data that look like this:
Text String | 340
Where "Text String" represents a variable string that has no other pattern or order to it and cannot be predicted in any mathematical way, and 340 represents a random integer. How can I find the sum of all of the values sharing an identical string, and organize this data based on total per unique string?
For example, say I have the dataset
Alpha | 3
Alpha | 6
Beta | 4
Gamma | 1
Gamma | 3
Gamma | 8
Omega | 10
I'm looking for some way to present the data as:
Alpha | 9
Beta | 4
Gamma | 12
Omega | 10
The point of this being that I have a dataset so large that I cannot enumerate this manually, and I have a finite yet unknown amount of strings that I cannot reliably predict what they are.
Consider using a pivot table, and then aggregate the numbers by string. This is probably the least ugly option. – Tim Biegeleisen
I am using tabstat in Stata, and using estpost and esttab to get its output to LaTeX. I have
tabstat
to display statistics by group. For example,
tabstat assets, by(industry) missing statistics(count mean sd p25 p50 p75)
The question I have is whether there is a way for tabstat (or other Stata commands) to display the output ordered by the value of the mean, so that those categories that have higher means will be on top. By default, Stata displays by alphabetical order of industry when I use tabstat.
tabstat does not offer such a hook, but there is an approach to problems like this that is general and quite easy to understand.
You don't provide a reproducible example, so we need one:
. sysuse auto, clear
(1978 Automobile Data)
. gen Make = word(make, 1)
. tab Make if foreign
Make | Freq. Percent Cum.
------------+-----------------------------------
Audi | 2 9.09 9.09
BMW | 1 4.55 13.64
Datsun | 4 18.18 31.82
Fiat | 1 4.55 36.36
Honda | 2 9.09 45.45
Mazda | 1 4.55 50.00
Peugeot | 1 4.55 54.55
Renault | 1 4.55 59.09
Subaru | 1 4.55 63.64
Toyota | 3 13.64 77.27
VW | 4 18.18 95.45
Volvo | 1 4.55 100.00
------------+-----------------------------------
Total | 22 100.00
Make here is like your variable industry: it is a string variable, so in tables Stata will tend to show it in alphabetical (alphanumeric) order.
The work-around has several easy steps, some optional.
Calculate a variable on which you want to sort. egen is often useful here.
. egen mean_mpg = mean(mpg), by(Make)
Map those values to a variable with distinct integer values. As two groups could have the same mean (or other summary statistic), make sure you break ties on the original string variable.
. egen group = group(mean_mpg Make)
This variable is created to have value 1 for the group with the lowest mean (or other summary statistic), 2 for the next lowest, and so forth. If the opposite order is desired, as in this question, flip the grouping variable around.
. replace group = -group
(74 real changes made)
There is a problem with this new variable: the values of the original string variable, here Make, are nowhere to be seen. labmask (to be installed from the Stata Journal website after search labmask) is a helper here. We use the values of the original string variable as the value labels of the new variable. (The idea is that the value labels become the "mask" that the integer variable wears.)
. labmask group, values(Make)
Optionally, work at the variable label of the new integer variable.
. label var group "Make"
Now we can tabulate using the categories of the new variable.
. tabstat mpg if foreign, s(mean) by(group) format(%2.1f)
Summary for variables: mpg
by categories of: group (Make)
group | mean
--------+----------
Subaru | 35.0
Mazda | 30.0
VW | 28.5
Honda | 26.5
Renault | 26.0
Datsun | 25.8
BMW | 25.0
Toyota | 22.3
Fiat | 21.0
Audi | 20.0
Volvo | 17.0
Peugeot | 14.0
--------+----------
Total | 24.8
-------------------
Note: other strategies are sometimes better or as good here.
If you collapse your data to a new dataset, you can then sort it as you please.
graph bar and graph dot are good at displaying summary statistics over groups, and the sort order can be tuned directly.
UPDATE 3 and 5 October 2021 A new helper command myaxis from SSC and the Stata Journal (see [paper here) condenses the example here with tabstat:
* set up data example
sysuse auto, clear
gen Make = word(make, 1)
* sort order variable and tabulation
myaxis Make2 = Make, sort(mean mpg) descending
tabstat mpg if foreign, s(mean) by(Make2) format(%2.1f)
I would look at the egenmore package on SSC. You can get that package by typing in Stata ssc install egenmore. In particular, I would look at the entry for axis() in the helpfile of egenmore. That contains an example that does exactly what you want.
I have survey data with the age of individuals in a variable named agen. Originally, the variable was string so I converted it to numeric using the encode command. When I tried to generate a new variable hhage referring to the age of head of household, the new variable generated was inconsistent.
The commands I used are the following:
encode agen, gen(age)
gen hhage=age if relntohrp==1
The new variable generated is not consistent because when I browsed it: the age of the hh head in the first houshehold is 65 while the new number generated was 63. When I checked the second household, the variable hhage reported 28 instead of 33 as the head of the housheold head. And so on.
Run help encode and you can read:
Do not use encode if varname contains numbers that merely happen to be stored
as strings; instead, use generate newvar = real(varname) or destring;
see real() or [D] destring.
For example:
clear all
set more off
input id str5 age
1 "32"
2 "14"
3 "65"
4 "54"
5 "98"
end
list
encode age, gen(age2)
destring age, gen(age3)
list, nolabel
Note the difference between using encode and destring. The former assigns numerical codes (1, 2, 3, ...) to the string values, while destring converts the string value to numeric. This you see stripping the value labels when you list:
. list, nolabel
+------------------------+
| id age age3 age2 |
|------------------------|
1. | 1 32 32 2 |
2. | 2 14 14 1 |
3. | 3 65 65 4 |
4. | 4 54 54 3 |
5. | 5 98 98 5 |
+------------------------+
A simple list or browse may confuse you because encode assigns the sequence of natural numbers but also assigns value labels equal to the original strings:
. list
+------------------------+
| id age age3 age2 |
|------------------------|
1. | 1 32 32 32 |
2. | 2 14 14 14 |
3. | 3 65 65 65 |
4. | 4 54 54 54 |
5. | 5 98 98 98 |
+------------------------+
The nolabel option shows the "underlying" data.
You mention it is inconsistent, but for future questions posting exact input and results is more useful for those trying to help you.
Try taking a look at this method? Sounds like you may have slipped up somewhere in your method.
I want to display (list) the value of a string variable DE15_WHY in Stata only when it is not missing (e.g. some subjects did not provide comments). I thought this would be easy:
list DE15_WHY if DE15_WHY != ""
This displays DE15_WHY for all subjects even if they do not have anything in DE15_WHY...
Is the string formatted wrongly? For example, does Stata think that all subjects have a valid observation for DE15_WHY? How do I fix this? I checked, and it is formatted as a string variable.
Stata also allows me to tabulate DE15_WHY, similar to R. This is a great option but does not display the entire contents of the string variable in the table. How do I get Stata to display the entire string?
#Metrics' answer has several good details, but I will here add more.
With string variables, Stata has only one definition of missing, namely that a string is empty, and contains precisely no characters.
One or more spaces, despite usually conveying nothing to people, do not qualify as missing so far as Stata is concerned.
The term "blank" is perhaps unclear here and thus better avoided.
If spaces somehow get into your string variables a condition such as
if trim(mystring) == ""
selects values that are empty or that have spaces and correspondingly a condition such as
if trim(mystring) != ""
selects values with other content. To replace spaces with empty strings, we thus go
replace mystring = "" if trim(mystring) == ""
In general, if you have rather long strings, Stata necessarily has a problem of where to display them. One tip is that list will show more than tabulate. If you want a tabulate and list hybrid, check out groups from SSC, using ssc inst groups.
Although the period . is the default or system missing value for numeric variables (or numeric scalars or matrix elements) in Stata, it does not attach any special meaning to the string ".".
sysuse auto
list rep78 in 1/10 if rep78 !=. # for non-missing
tab rep78 # default behaviour is to report only non-missing
tab rep78, missing # if you want also missing
If variable is a string with missing indicated by .
list yourvariable if yourvariable !="."
If variable is a string with missing indicated by blank
list yourvariable if yourvariable !=""
Example:
my my1
ab 1
cd 2
3
ef 4
list my if my !=""
+----+
| my |
|----|
1. | ab |
2. | cd |
4. | ef |
+----+
tab will treat both blank and . as missing.
.
tab my
my | Freq. Percent Cum.
------------+-----------------------------------
ab | 1 33.33 33.33
cd | 1 33.33 66.67
ef | 1 33.33 100.00
------------+-----------------------------------
Total | 3 100.00
tab my,missing
my | Freq. Percent Cum.
------------+-----------------------------------
| 1 25.00 25.00
ab | 1 25.00 50.00
cd | 1 25.00 75.00
ef | 1 25.00 100.00
------------+-----------------------------------
Total | 4 100.00