I am using tabstat in Stata, and using estpost and esttab to get its output to LaTeX. I have
tabstat
to display statistics by group. For example,
tabstat assets, by(industry) missing statistics(count mean sd p25 p50 p75)
The question I have is whether there is a way for tabstat (or other Stata commands) to display the output ordered by the value of the mean, so that those categories that have higher means will be on top. By default, Stata displays by alphabetical order of industry when I use tabstat.
tabstat does not offer such a hook, but there is an approach to problems like this that is general and quite easy to understand.
You don't provide a reproducible example, so we need one:
. sysuse auto, clear
(1978 Automobile Data)
. gen Make = word(make, 1)
. tab Make if foreign
Make | Freq. Percent Cum.
------------+-----------------------------------
Audi | 2 9.09 9.09
BMW | 1 4.55 13.64
Datsun | 4 18.18 31.82
Fiat | 1 4.55 36.36
Honda | 2 9.09 45.45
Mazda | 1 4.55 50.00
Peugeot | 1 4.55 54.55
Renault | 1 4.55 59.09
Subaru | 1 4.55 63.64
Toyota | 3 13.64 77.27
VW | 4 18.18 95.45
Volvo | 1 4.55 100.00
------------+-----------------------------------
Total | 22 100.00
Make here is like your variable industry: it is a string variable, so in tables Stata will tend to show it in alphabetical (alphanumeric) order.
The work-around has several easy steps, some optional.
Calculate a variable on which you want to sort. egen is often useful here.
. egen mean_mpg = mean(mpg), by(Make)
Map those values to a variable with distinct integer values. As two groups could have the same mean (or other summary statistic), make sure you break ties on the original string variable.
. egen group = group(mean_mpg Make)
This variable is created to have value 1 for the group with the lowest mean (or other summary statistic), 2 for the next lowest, and so forth. If the opposite order is desired, as in this question, flip the grouping variable around.
. replace group = -group
(74 real changes made)
There is a problem with this new variable: the values of the original string variable, here Make, are nowhere to be seen. labmask (to be installed from the Stata Journal website after search labmask) is a helper here. We use the values of the original string variable as the value labels of the new variable. (The idea is that the value labels become the "mask" that the integer variable wears.)
. labmask group, values(Make)
Optionally, work at the variable label of the new integer variable.
. label var group "Make"
Now we can tabulate using the categories of the new variable.
. tabstat mpg if foreign, s(mean) by(group) format(%2.1f)
Summary for variables: mpg
by categories of: group (Make)
group | mean
--------+----------
Subaru | 35.0
Mazda | 30.0
VW | 28.5
Honda | 26.5
Renault | 26.0
Datsun | 25.8
BMW | 25.0
Toyota | 22.3
Fiat | 21.0
Audi | 20.0
Volvo | 17.0
Peugeot | 14.0
--------+----------
Total | 24.8
-------------------
Note: other strategies are sometimes better or as good here.
If you collapse your data to a new dataset, you can then sort it as you please.
graph bar and graph dot are good at displaying summary statistics over groups, and the sort order can be tuned directly.
UPDATE 3 and 5 October 2021 A new helper command myaxis from SSC and the Stata Journal (see [paper here) condenses the example here with tabstat:
* set up data example
sysuse auto, clear
gen Make = word(make, 1)
* sort order variable and tabulation
myaxis Make2 = Make, sort(mean mpg) descending
tabstat mpg if foreign, s(mean) by(Make2) format(%2.1f)
I would look at the egenmore package on SSC. You can get that package by typing in Stata ssc install egenmore. In particular, I would look at the entry for axis() in the helpfile of egenmore. That contains an example that does exactly what you want.
Related
I am trying to convert a column in Stata (AP1) in the photo below so that the entries are string types. Currently the entries appear as characters ("cultivateur" for example) but are shown as being of the type int.
I used the following code to try and change them to strings.
label values AP1 .
tostring AP1, replace
AP1 was long now str5
While the AP1 column becomes a string type all the characters now become integers which is not what I need in order to subset the data based on observations.
Does anyone know I can switch this column to the string type without the characters becoming integers?
Your images are barely readable -- on a phone or laptop -- but perhaps readable by anyone using a very large monitor. Please see the Stata tag wiki for guidance on presenting reproducible data examples.
What is going on can be explained reproducibly by considering
. sysuse auto, clear
(1978 automobile data)
. tab foreign
Car origin | Freq. Percent Cum.
------------+-----------------------------------
Domestic | 52 70.27 70.27
Foreign | 22 29.73 100.00
------------+-----------------------------------
Total | 74 100.00
. tab foreign, nolabel
Car origin | Freq. Percent Cum.
------------+-----------------------------------
0 | 52 70.27 70.27
1 | 22 29.73 100.00
------------+-----------------------------------
Total | 74 100.00
. tostring foreign, gen(str_foreign)
str_foreign generated as str1
. tab str_foreign
Car origin | Freq. Percent Cum.
------------+-----------------------------------
0 | 52 70.27 70.27
1 | 22 29.73 100.00
------------+-----------------------------------
Total | 74 100.00
. d *foreign
Variable Storage Display Value
name type format label Variable label
------------------------------------------------------------------------------------------------------------------
foreign byte %8.0g origin Car origin
str_foreign str1 %9s Car origin
foreign like your problematic variable is a numeric variable with value labels. (The term "column" is not standard in Stata for variables in the dataset.) Push it through tostring and you get a string variable containing integer characters. Stata did what you asked.
To get a string variable containing the text of the value labels, you need to apply the decode command, which was written for precisely this purpose (and incidentally, long predates tostring as an official command).
I have 154,901 rows of data that look like this:
Text String | 340
Where "Text String" represents a variable string that has no other pattern or order to it and cannot be predicted in any mathematical way, and 340 represents a random integer. How can I find the sum of all of the values sharing an identical string, and organize this data based on total per unique string?
For example, say I have the dataset
Alpha | 3
Alpha | 6
Beta | 4
Gamma | 1
Gamma | 3
Gamma | 8
Omega | 10
I'm looking for some way to present the data as:
Alpha | 9
Beta | 4
Gamma | 12
Omega | 10
The point of this being that I have a dataset so large that I cannot enumerate this manually, and I have a finite yet unknown amount of strings that I cannot reliably predict what they are.
Consider using a pivot table, and then aggregate the numbers by string. This is probably the least ugly option. – Tim Biegeleisen
I have a table that more or less looks like this:
Team_Name | Total_Errors | Total_Volume
_______________________________________
Sam | 3 | 1350
Sam | 5 | 1100
Jamie | 7 | 1600
Mark | 3 | 1220
Jamie | 10 | 2100
Mark | 5 | 1300
Sam | 5 | 1100
Jamie | 3 | 1900
Just with a lot more rows. I want to create a formula that calculates the average total_errors for just the numbers corresponsding to Team_names "Jamie" and "Sam".
How do I do this?
Something like Average(If(June(Team_Name)="Jamie","Sam"......?
(the table name is June)
thanks in advance
You can use Sum/Count:
=(SUMIF(A1:A8,"Jamie",B1:B8)+SUMIF(A1:A8,"Sam",B1:B8))/(COUNTIF(A1:A8,"Jamie")+COUNTIF(A1:A8,"Sam"))
I would go with a simple pivot table that uses June as a data source.
Put your Team_Name filed as a rows, and Total_Errors as Values. Change the Field settings of your Total_Errors to be an average, and change how many decimal points you want to see.
You can then apply whatever filters /Slicers you want and get your desired result.
Here's a screenshot (its on MAC, but you'll get the idea)
Assuming DATA in located at A1:C9 enter this formula at F5, note tat the Criteria Range used by the formula is locaed at E2:E4 (see picture below):
=DAVERAGE($A$1:$C$9,$B$1,$E$2:$E$4)
Slightly awkward requirements, so I apologise if the explanation isn't overly clear.
I have two tables, with very similar data (though not identical), which I'd like to merge together and total up as follows.
Both Tables Contain the following headings
Invoice, Date, Account, No., Description, Blank, Credit, Debit, Total
However, they are for slightly different things (support and commission to be exact). Both tables contain multiple rows of data for various customers, but some customers may only be in one table or the other.
I've used pivot tables for each table individually to show the sum totals for each customer (so I have a table of every customers total support value, and a separate table for every customers total commission). Similarly to above though, customers may be in one pivot table but not the other.
What I would like is a single table to show every customer from both tables (if they are in both tables, I only want one record), with the total support (showing 0 if the customer isn't in the table), the total commission (again, 0 if the customer isn't is that table), and ideally the total overall (although this is a simple sum of the other two, so can be added in after if required...
As an example, if the relevant columns in two tables were;
Support Commission
Account | Total Account | Total
----------------- -----------------
A | 25.00 A | 5.00
A | 25.00 C | -10.00
A | 45.00 C | 10.00
B | 10.00 C | 30.00
B | -5.00 C | 25.00
C | 5.00 D | 25.00
C | 10.00 D | -5.00
C | 10.00 E | 15.00
E | 25.00
I'm trying to end up with a table that looks like;
Account | Support Total | Commission Total | Overall Total
----------------------------------------------------------------
A | 95.00 | 5.00 | 100.00
B | 5.00 | 0.00 | 5.00
C | 25.00 | 55.00 | 80.00
D | 0.00 | 20.00 | 20.00
E | 25.00 | 15.00 | 50.00
This isn't something I'd want to do manually, as my actual tables have 2000+ rows in them.
Any help would be greatly appreciated. (I've been messing around with various Excel features for a long time now and I've run out of ideas)
Use multiple consolidation ranges (e.g. further details here - but you can stop short of creating the Table).
Ensure your separate sources have the same column labels:
N.B. 25+15 = 40 :)
I want to display (list) the value of a string variable DE15_WHY in Stata only when it is not missing (e.g. some subjects did not provide comments). I thought this would be easy:
list DE15_WHY if DE15_WHY != ""
This displays DE15_WHY for all subjects even if they do not have anything in DE15_WHY...
Is the string formatted wrongly? For example, does Stata think that all subjects have a valid observation for DE15_WHY? How do I fix this? I checked, and it is formatted as a string variable.
Stata also allows me to tabulate DE15_WHY, similar to R. This is a great option but does not display the entire contents of the string variable in the table. How do I get Stata to display the entire string?
#Metrics' answer has several good details, but I will here add more.
With string variables, Stata has only one definition of missing, namely that a string is empty, and contains precisely no characters.
One or more spaces, despite usually conveying nothing to people, do not qualify as missing so far as Stata is concerned.
The term "blank" is perhaps unclear here and thus better avoided.
If spaces somehow get into your string variables a condition such as
if trim(mystring) == ""
selects values that are empty or that have spaces and correspondingly a condition such as
if trim(mystring) != ""
selects values with other content. To replace spaces with empty strings, we thus go
replace mystring = "" if trim(mystring) == ""
In general, if you have rather long strings, Stata necessarily has a problem of where to display them. One tip is that list will show more than tabulate. If you want a tabulate and list hybrid, check out groups from SSC, using ssc inst groups.
Although the period . is the default or system missing value for numeric variables (or numeric scalars or matrix elements) in Stata, it does not attach any special meaning to the string ".".
sysuse auto
list rep78 in 1/10 if rep78 !=. # for non-missing
tab rep78 # default behaviour is to report only non-missing
tab rep78, missing # if you want also missing
If variable is a string with missing indicated by .
list yourvariable if yourvariable !="."
If variable is a string with missing indicated by blank
list yourvariable if yourvariable !=""
Example:
my my1
ab 1
cd 2
3
ef 4
list my if my !=""
+----+
| my |
|----|
1. | ab |
2. | cd |
4. | ef |
+----+
tab will treat both blank and . as missing.
.
tab my
my | Freq. Percent Cum.
------------+-----------------------------------
ab | 1 33.33 33.33
cd | 1 33.33 66.67
ef | 1 33.33 100.00
------------+-----------------------------------
Total | 3 100.00
tab my,missing
my | Freq. Percent Cum.
------------+-----------------------------------
| 1 25.00 25.00
ab | 1 25.00 50.00
cd | 1 25.00 75.00
ef | 1 25.00 100.00
------------+-----------------------------------
Total | 4 100.00