Exporting results - statistics

I'm sure this is an issue anyone who uses Stata for publications or reports has run into:
How do you conveniently export your output to something that can be parsed by a scripting language or Excel?
There are a few ado files that do this for specific commands. For example:
findit tabout
findit outreg2
But what about exporting the output of the table command? Or the results of an anova?
I would love to hear about how Stata users address this problem for either specific commands or in general.

After experimenting with this for a while, I've found a solution that works for me.
There are a variety of ADOs that handle exporting specific functions. I've made use of outreg2 for regressions and tabout for summary statistics.
For more simple commands, it's easy to write your own programs to save results automatically to plaintext in a standard format. Here are a few I wrote...note that these both display results (to be saved to a log file) and export them into text files – if you wanted to just save to text you could get rid of the di's and qui the sum, tab, etc. commands:
cap program drop sumout
program define sumout
di ""
di ""
di "Summary of `1'"
di ""
sum `1', d
qui matrix X = (r(mean), r(sd), r(p50), r(min), r(max))
qui matrix colnames X = mean sd median min max
qui mat2txt, matrix(X) saving("`2'") replace
end
cap program drop tab2_chi_out
program define tab2_chi_out
di ""
di ""
di "Tabulation of `1' and `2'"
di ""
tab `1' `2', chi2
qui matrix X = (r(p), r(chi2))
qui matrix colnames X = chi2p chi2
qui mat2txt, matrix(X) saving("`3'") replace
end
cap program drop oneway_out
program define oneway_out
di ""
di ""
di "Oneway anova with dv = `1' and iv = `2'"
di ""
oneway `1' `2'
qui matrix X = (r(F), r(df_r), r(df_m), Ftail(r(df_m), r(df_r), r(F)))
qui matrix colnames X = anova_between_groups_F within_groups_df between_groups_df P
qui mat2txt, matrix(X) saving("`3'") replace
end
cap program drop anova_out
program define anova_out
di ""
di ""
di "Anova command: anova `1'"
di ""
anova `1'
qui matrix X = (e(F), e(df_r), e(df_m), Ftail(e(df_m), e(df_r), e(F)), e(r2_a))
qui matrix colnames X = anova_between_groups_F within_groups_df between_groups_df P RsquaredAdj
qui mat2txt, matrix(X) saving("`2'") replace
end
The question is then how to get the output into Excel and format it. I found that the best way to import the text output files from Stata into Excel is to concatenate them into one big text file and then import that single file using the Import Text File... feature in Excel.
I concatenate the files by placing this Ruby code in the output folder and then running int from my Do file with qui shell cd path/to/output/folder/ && ruby table.rb:
output = ""
Dir.new(".").entries.each do |file|
next if file =~/\A\./ || file == "table.rb" || file == "out.txt"
if file =~ /.*xml/
system "rm #{file}"
next
end
contents = File.open(file, "rb").read
output << "\n\n#{file}\n\n" << contents
end
File.open("out.txt", 'w') {|f| f.write(output)}
Once I import out.txt into its own sheet in Excel, I use a bunch of Excel's built-in functions to pull the data together into nice, pretty tables.
I use a combination of vlookup, offset, match, iferror, and hidden columns with cell numbers and filenames to do this. The source .txt file is included in out.txt just above the contents of that file, which lets you look up the contents of the file using these functions and then reference specific cells using vlookup and offset.
This Excel business is actually the most complicated part of this system and there's really no good way to explain it without showing you the file, though hopefully you can get enough of an idea to figure it out for yourself. If not, feel free to contact me through http://maxmasnick.com and I can get you more info.

I have found that the estout package is the most developed and has good documentation.

This is an old question and a lot has happened since it was posted.
Stata now has several built-in commands and functions that allow anyone to
export customized output fairly easily:
putexcel
putexcel with advanced syntax
putdocx
putpdf
There are also equivalent Mata functions / classes, which offer greater flexibility:
_docx*()
Pdf*()
xl()
From my experience, there aren't 100% general solutions. Community-contributed commands such as estout are now mature enough to handle most basic operations. That said, if you have something that deviates even slightly from the template you will have to program this yourself.

Most tutorials throw in several packages where it would indeed nice to have only one exporting everything, which is what Max suggests above with his interesting method.
I personally use tabout for summary statistics and frequencies, estout for regression output, and am trying out mkcorr for correlation matrixes.

It's been a while, but I believe you can issue a log command to capture the output.
log using c:\data\anova_analysis.log, text
[commands]
log close

I use estpost-- part of the estout package-- to tabulate results from non-estimation commands. You can then store them and export easily.
Here's an example:
estpost corr varA varB varC varD, matrix
est store corrs
esttab corrs using corrs.rtf, replace
You can then add options to change formatting, etc.

You can use asdoc that is available on SSC. To download,
ssc install asdoc
asdoc works well with almost all Stata commands. Specifically, it produces publication quality tables for :
summarize command - to report summary statistics
cor or pwcorr command - to report correlations
tabstat - for flexible tables of descriptive statistics
tabulate - for one-way, two-way, three-way tabulations
regress - for detailed, nested, and wide regression tables
table - flexible tables
and many more. You can explore more about asdoc here
https://fintechprofessor.com/2018/01/31/asdoc/

Related

how do I get rid of leading/trailing spaces in SAS search terms?

I have had to look up hundreds (if not thousands) of free-text answers on google, making notes in Excel along the way and inserting SAS-code around the answers as a last step.
The output looks like this:
This output contains an unnecessary number of blank spaces, which seems to confuse SAS's search to the point where the observations can't be properly located.
It works if I manually erase superflous spaces, but that will probably take hours. Is there an automated fix for this, either in SAS or in excel?
I tried using the STRIP-function, to no avail:
else if R_res_ort_txt=strip(" arild ") and R_kom_lan=strip(" skåne ") then R_kommun=strip(" Höganäs " );
If you want to generate a string like:
if R_res_ort_txt="arild" and R_kom_lan="skåne" then R_kommun="Höganäs";
from three variables, let's call them A B C, then just use code like:
string=catx(' ','if R_res_ort_txt=',quote(trim(A))
,'and R_kom_lan=',quote(trim(B))
,'then R_kommun=',quote(trim(C)),';') ;
Or if you are just writing that string to a file just use this PUT statement syntax.
put 'if R_res_ort_txt=' A :$quote. 'and R_kom_lan=' B :$quote.
'then R_kommun=' C :$quote. ';' ;
A saner solution would be to continue using the free-text answers as data and perform your matching criteria for transformations with a left join.
proc import out=answers datafile='my-free-text-answers.xlsx';
data have;
attrib R_res_ort_txt R_kom_lan length=$100;
input R_res_ort_txt ...;
datalines4;
... whatever all those transforms will be performed on...
;;;;
proc sql;
create table want as
select
have.* ,
answers.R_kommun_answer as R_kommun
from
have
left join
answers
on
have.R_res_ort_txt = answers.res_ort_answer
& have.R_kom_lan = abswers.kom_lan_answer
;
I solved this by adding quotes in excel using the flash fill function:
https://www.youtube.com/watch?v=nE65QeDoepc

R DiagrammeR package Mermaid text using actual calculation results

I would like to utilize the DiagrammeR package for a simple flow chart in my Rmarkdown. However, I couldn't figure out a way to use actual output from a data table into the text. Suppose I have a simple query of a database with total records, patients count and date in year info for three different cohorts.
I wanted to create a diagram using Mermaid. The codes look at this.
Total = paste0('Records:',b1$records,' Patients:',b1$patients,' Year:',b1$year)
# (Records:1000 Patients:822 Year:5)
Sub1 = paste0('Records:',b2$records,' Patients:',b2$patients,' Year:',b2$year)
Sub2 = paste0('Records:',b3$records,' Patients:',b3$patients,' Year:',b3$year)
mermaid("
graph TB
A[Total] --> B{Sub1} --> C{Sub2}
")
Instead of Printing out diagram with: Records:1000 Patients:822 Year:5 in the A, it shows verbatim word "Total".
Any suggestion on how to do it correctly?
Thanks!
You are one step away from what you'd like to achieve. Please try this simple example below to see the logic:
library(DiagrammeR)
Stracture:
DiagrammeR(
"
graph TB
A[Question] -->B[Answer]
"
)
1. Define answer node:
B <- paste0("There are ", nrow(iris), " records")
2. Combine it with other components, using ; to separate statements:
results <- paste0("graph TB; A[How many rows does iris have?]-->", "B[", B, "]")
3. Call 'results' in DiagrammeR:
DiagrammeR(diagram = results)
The final plot should refresh when your calculation updates.
The plot that calls your calculation

How to extract relationships from a text

I am currently new with NLP and need guidance as of how I can solve this problem.
I am currently doing a filtering technique where I need to brand data in a database as either being correct or incorrect. I am given a structured data set, with columns and rows.
However, the filtering conditions are given to me in a text file.
An example filtering text file could be the following:
Values in the column ID which are bigger than 99
Values in the column Cash which are smaller than 10000
Values in the column EndDate that are smaller than values in StartDate
Values in the column Name that contain numeric characters
Any value that follows those conditions should be branded as bad.
However, I want to extract those conditions and append them to the program that I've made so far.
For instance, for the conditions above, I would like to produce
`if ID>99`
`if Cash<10000`
`if EndDate < StartDate`
`if Name LIKE %[1-9]%`
How can I achieve the above result using the Stanford NLP? (or any other NLP library).
This doesn't look like a machine learning problem; it's a simple parser. You have a simple syntax, from which you can easily extract the salient features:
column name
relationship
target value or target column
The resulting "action rule" is simply removing the "syntactic sugar" words and converting the relationship -- and possibly the target value -- to its symbolic form.
Enumerate all of your critical words for each position in a lexicon. Then use basic string manipulation operators in your chosen implementation language to find the three needed fields.
EXAMPLE
Given the data above, your lexicons might be like this:
column_trigger = "Values in the column"
relation_dict = {
"are bigger than" : ">",
"are smaller than" : "<",
"contain" : "LIKE",
...
}
value_desc = {
"numeric characters" : "%[1-9]%",
...
}
From here, use these items in standard parsing. If you're not familiar with that, please look up the basics of a simple sentence grammar in your favourite programming language, with rules such as such as
SENTENCE => SUBJ VERB OBJ
Does that get you going?

Extracting data from filenames in Gnuplot

Is there a way for Gnuplot to read and recognize structured strings? Specifically, I have a few hundred files, all containing measurement data, with measurement conditions defined in the filename.
My files look something like "100d5mK2d0T.txt", which would mean that this data was acquired at 100.5mK temperature and 2.0T magnetic field.
Any chance I could extract the temperature and field strength data from the name, and use them as labels in the plot?
Thanks in advance.
With gnuplot's internal string processing you could come up with a solution (using substr and strstrt), but thats quite verbose.
Its better to use an external tool for the string processing, like perl:
filename = '100d5mK2d0T.txt'
str = system('echo "'.filename. '" | perl -pe ''s/(\d+)d(\d+)mK(\d+)d(\d+)T.txt/\1.\2 \3.\4/'' ')
temperature = word(str, 1)
magnetic_field = word(str, 2)
set label at graph 0.1,0.9 "Temperature: ".temperature." mK"
set label at graph 0.1,0.8 "Magnetic field: ".magnetic_field." T"

constructing data-type instances from CSV

I have CSV data (inherited - no choice here) which I need to use to create data type instances in Haskell. parsing CSV is no problem - tutorials and APIs abound.
Here's what 'show' generates for my simplified trimmed-down test-case:
JField {fname = "cardNo", ftype = "str"} (string representation)
I am able to do a read to convert this string into a JField data record. My CSV data is just the values of the fields, so the CSV row corresponding to JField above is:
cardNo, str
and I am reading these in as List of string ["cardNo", "str"]
So - it's easy enough to brute-force the exact format of "string representation" (but writing Java or python-style string-formatting in Haskell isn't my goal here).
I thought of doing something like this (the first List is static, and the second list would be read file CSV) :
let stp1 = zip ["fname = ", "ftype ="] ["cardNo", "str"]
resulting in
[("fname = ","cardNo"),("ftype =","str")]
and then concatenating the tuples - either explicitly with ++ or in some more clever way yet to be determined.
This is my first simple piece of code outside of tutorials, so I'd like to know if this seems a reasonably Haskellian way of doing this, or what clearly better ways there are to build just this piece:
fname = "cardNo", ftype = "str"
Not expecting solutions (this is not homework, it's a learning exercise), but rather critique or guidelines for better ways to do this. Brute-forcing it would be easy but would defeat my objective, which is to learn
I might be way off, but wouldn't a map be better here? I guess I'm assuming that you read the file in with each row as a [String] i.e.
field11, field12
field21, field22
etc.
You could write
map (\[x,y] -> JField {fname = x, ftype = y}) data
where data is your input. I think that would do it.
If you already have the value of the fname field (say, in the variable fn) and the value of the ftype field (in ft), just do JField {fname=fn, ftype=ft}. For non-String fields, just insert a read where appropriate.

Resources