Listing nonmissing string variables in Stata - string

I want to display (list) the value of a string variable DE15_WHY in Stata only when it is not missing (e.g. some subjects did not provide comments). I thought this would be easy:
list DE15_WHY if DE15_WHY != ""
This displays DE15_WHY for all subjects even if they do not have anything in DE15_WHY...
Is the string formatted wrongly? For example, does Stata think that all subjects have a valid observation for DE15_WHY? How do I fix this? I checked, and it is formatted as a string variable.
Stata also allows me to tabulate DE15_WHY, similar to R. This is a great option but does not display the entire contents of the string variable in the table. How do I get Stata to display the entire string?

#Metrics' answer has several good details, but I will here add more.
With string variables, Stata has only one definition of missing, namely that a string is empty, and contains precisely no characters.
One or more spaces, despite usually conveying nothing to people, do not qualify as missing so far as Stata is concerned.
The term "blank" is perhaps unclear here and thus better avoided.
If spaces somehow get into your string variables a condition such as
if trim(mystring) == ""
selects values that are empty or that have spaces and correspondingly a condition such as
if trim(mystring) != ""
selects values with other content. To replace spaces with empty strings, we thus go
replace mystring = "" if trim(mystring) == ""
In general, if you have rather long strings, Stata necessarily has a problem of where to display them. One tip is that list will show more than tabulate. If you want a tabulate and list hybrid, check out groups from SSC, using ssc inst groups.
Although the period . is the default or system missing value for numeric variables (or numeric scalars or matrix elements) in Stata, it does not attach any special meaning to the string ".".

sysuse auto
list rep78 in 1/10 if rep78 !=. # for non-missing
tab rep78 # default behaviour is to report only non-missing
tab rep78, missing # if you want also missing
If variable is a string with missing indicated by .
list yourvariable if yourvariable !="."
If variable is a string with missing indicated by blank
list yourvariable if yourvariable !=""
Example:
my my1
ab 1
cd 2
3
ef 4
list my if my !=""
+----+
| my |
|----|
1. | ab |
2. | cd |
4. | ef |
+----+
tab will treat both blank and . as missing.
.
tab my
my | Freq. Percent Cum.
------------+-----------------------------------
ab | 1 33.33 33.33
cd | 1 33.33 66.67
ef | 1 33.33 100.00
------------+-----------------------------------
Total | 3 100.00
tab my,missing
my | Freq. Percent Cum.
------------+-----------------------------------
| 1 25.00 25.00
ab | 1 25.00 50.00
cd | 1 25.00 75.00
ef | 1 25.00 100.00
------------+-----------------------------------
Total | 4 100.00

Related

tostring turns character values into integers

I am trying to convert a column in Stata (AP1) in the photo below so that the entries are string types. Currently the entries appear as characters ("cultivateur" for example) but are shown as being of the type int.
I used the following code to try and change them to strings.
label values AP1 .
tostring AP1, replace
AP1 was long now str5
While the AP1 column becomes a string type all the characters now become integers which is not what I need in order to subset the data based on observations.
Does anyone know I can switch this column to the string type without the characters becoming integers?
Your images are barely readable -- on a phone or laptop -- but perhaps readable by anyone using a very large monitor. Please see the Stata tag wiki for guidance on presenting reproducible data examples.
What is going on can be explained reproducibly by considering
. sysuse auto, clear
(1978 automobile data)
. tab foreign
Car origin | Freq. Percent Cum.
------------+-----------------------------------
Domestic | 52 70.27 70.27
Foreign | 22 29.73 100.00
------------+-----------------------------------
Total | 74 100.00
. tab foreign, nolabel
Car origin | Freq. Percent Cum.
------------+-----------------------------------
0 | 52 70.27 70.27
1 | 22 29.73 100.00
------------+-----------------------------------
Total | 74 100.00
. tostring foreign, gen(str_foreign)
str_foreign generated as str1
. tab str_foreign
Car origin | Freq. Percent Cum.
------------+-----------------------------------
0 | 52 70.27 70.27
1 | 22 29.73 100.00
------------+-----------------------------------
Total | 74 100.00
. d *foreign
Variable Storage Display Value
name type format label Variable label
------------------------------------------------------------------------------------------------------------------
foreign byte %8.0g origin Car origin
str_foreign str1 %9s Car origin
foreign like your problematic variable is a numeric variable with value labels. (The term "column" is not standard in Stata for variables in the dataset.) Push it through tostring and you get a string variable containing integer characters. Stata did what you asked.
To get a string variable containing the text of the value labels, you need to apply the decode command, which was written for precisely this purpose (and incidentally, long predates tostring as an official command).

Add "invisible" decimal places to end of number?

I am printing a "Table" to the console. I will be using this same table structure for several different variables. However as you can see from Output below, the lines don't all align.
One way to resolve it would be to increase the number of decimal places (e.g. 6.730000 for Standard Deviation) which would push the line into place.
However, I do not want this many decimal places.
Is it possible to add extra 0s to the end of a number, and make these invisible?
I am planning on using this table structure for several variables, and the length of Mean, Stddev, and Median will likely never be more than 6 characters.
EDIT - I would really like to ensure that each value which appears in the table will be 6 characters long, and if it is not 6 characters long, add additional "invisible" zeros.
Input
# Create and structure Table to store descriptive statistics for each variable.
subtitle = "| Mean | Stddev | Median |"
structure = '| {:0.2f} | {:0.2f} | {:0.2f} |'
lines = '=' * len(subtitle)
# Print table.
print(lines)
print(subtitle)
print(lines)
print(structure.format(mean, std, median))
print(lines)
Output:
======================================
| Mean | Stddev | Median |
======================================
| 181.26 | 6.73 | 180.34 |
======================================
Didn't really figure this out - but found a workaround.
I just did the following:
"| {:^6} | {:^6} | {:^6} | {:^6} | {:^6} |"
This keeps the width between | consistent.

Excel: Sum cells if they share an identical unknown string

I have 154,901 rows of data that look like this:
Text String | 340
Where "Text String" represents a variable string that has no other pattern or order to it and cannot be predicted in any mathematical way, and 340 represents a random integer. How can I find the sum of all of the values sharing an identical string, and organize this data based on total per unique string?
For example, say I have the dataset
Alpha | 3
Alpha | 6
Beta | 4
Gamma | 1
Gamma | 3
Gamma | 8
Omega | 10
I'm looking for some way to present the data as:
Alpha | 9
Beta | 4
Gamma | 12
Omega | 10
The point of this being that I have a dataset so large that I cannot enumerate this manually, and I have a finite yet unknown amount of strings that I cannot reliably predict what they are.
Consider using a pivot table, and then aggregate the numbers by string. This is probably the least ugly option. – Tim Biegeleisen

Excel SUMIF when another cell contains text

So for example purposes, I have the following table:
| | A | B |
| |------------|----------|
| 1 |Description |Amount |
| 2 |------------|----------|
| 3 |Item1 | 5.00|
| 4 |Item2** | 29.00|
| 5 |Item3 | 1.00|
| 6 |Item4** | 5.00|
| 7 |------------|----------|
| 8 |Star Total | 34.00|
| 9 |------------|----------|
I want to create a formula in B8 that calculates the sum of the amounts if the description of that amount contains "**" (or some other denoting text). In this particular example I would like a formula that returns 34 since only Item2 and Item4 contain "**".
I tried to use something like this, but it only worked based on the value in A3:
=SUMIF(A3:A6, ISNUMBER(SEARCH("**", A3)), B3:B6)
Any suggestions would be appreciated!
The asterisk is the wildcard symbol that can be used in Sumif(), so you may want to change the denoting text to some other symbols, for example ##. Then this formula will work:
=SUMIF(A2:A10,"*##*",B2:B10)
If you want to keep the asterisks, the formula gets a bit curlier.
=SUMIF(A2:A10,"*~*~**",B2:B10)
The two middle asterisks are escaped with the tilde character.
You can escape the wildcard character and turn it into a literal * by prefixing it with a swung dash (tilde, ~) and so leave your data unchanged:
=SUMIF(A2:A7,"*~*~*",B2:B7)
IMO worthwhile because astrisks are relatively 'elegant'.

Stata tabstat change order/sort?

I am using tabstat in Stata, and using estpost and esttab to get its output to LaTeX. I have
tabstat
to display statistics by group. For example,
tabstat assets, by(industry) missing statistics(count mean sd p25 p50 p75)
The question I have is whether there is a way for tabstat (or other Stata commands) to display the output ordered by the value of the mean, so that those categories that have higher means will be on top. By default, Stata displays by alphabetical order of industry when I use tabstat.
tabstat does not offer such a hook, but there is an approach to problems like this that is general and quite easy to understand.
You don't provide a reproducible example, so we need one:
. sysuse auto, clear
(1978 Automobile Data)
. gen Make = word(make, 1)
. tab Make if foreign
Make | Freq. Percent Cum.
------------+-----------------------------------
Audi | 2 9.09 9.09
BMW | 1 4.55 13.64
Datsun | 4 18.18 31.82
Fiat | 1 4.55 36.36
Honda | 2 9.09 45.45
Mazda | 1 4.55 50.00
Peugeot | 1 4.55 54.55
Renault | 1 4.55 59.09
Subaru | 1 4.55 63.64
Toyota | 3 13.64 77.27
VW | 4 18.18 95.45
Volvo | 1 4.55 100.00
------------+-----------------------------------
Total | 22 100.00
Make here is like your variable industry: it is a string variable, so in tables Stata will tend to show it in alphabetical (alphanumeric) order.
The work-around has several easy steps, some optional.
Calculate a variable on which you want to sort. egen is often useful here.
. egen mean_mpg = mean(mpg), by(Make)
Map those values to a variable with distinct integer values. As two groups could have the same mean (or other summary statistic), make sure you break ties on the original string variable.
. egen group = group(mean_mpg Make)
This variable is created to have value 1 for the group with the lowest mean (or other summary statistic), 2 for the next lowest, and so forth. If the opposite order is desired, as in this question, flip the grouping variable around.
. replace group = -group
(74 real changes made)
There is a problem with this new variable: the values of the original string variable, here Make, are nowhere to be seen. labmask (to be installed from the Stata Journal website after search labmask) is a helper here. We use the values of the original string variable as the value labels of the new variable. (The idea is that the value labels become the "mask" that the integer variable wears.)
. labmask group, values(Make)
Optionally, work at the variable label of the new integer variable.
. label var group "Make"
Now we can tabulate using the categories of the new variable.
. tabstat mpg if foreign, s(mean) by(group) format(%2.1f)
Summary for variables: mpg
by categories of: group (Make)
group | mean
--------+----------
Subaru | 35.0
Mazda | 30.0
VW | 28.5
Honda | 26.5
Renault | 26.0
Datsun | 25.8
BMW | 25.0
Toyota | 22.3
Fiat | 21.0
Audi | 20.0
Volvo | 17.0
Peugeot | 14.0
--------+----------
Total | 24.8
-------------------
Note: other strategies are sometimes better or as good here.
If you collapse your data to a new dataset, you can then sort it as you please.
graph bar and graph dot are good at displaying summary statistics over groups, and the sort order can be tuned directly.
UPDATE 3 and 5 October 2021 A new helper command myaxis from SSC and the Stata Journal (see [paper here) condenses the example here with tabstat:
* set up data example
sysuse auto, clear
gen Make = word(make, 1)
* sort order variable and tabulation
myaxis Make2 = Make, sort(mean mpg) descending
tabstat mpg if foreign, s(mean) by(Make2) format(%2.1f)
I would look at the egenmore package on SSC. You can get that package by typing in Stata ssc install egenmore. In particular, I would look at the entry for axis() in the helpfile of egenmore. That contains an example that does exactly what you want.

Resources