Exporting Tabstat in Stata - statistics

I am trying to export tabstat result from stata. I am using following commands.
estpost tabstat x1 x2 x3 x4, by(country)
estout using Data\summary.csv
However, when I open the CSV file, I only find
country
b
in the CSV file. Please let me know if there is something wrong in the commands I am using.

Answer coming from:
http://repec.org/bocode/e/estout/estpost.html#estpost101b
by country: eststo: quietly estpost summarize x1 x2 x3 x4, listwise
esttab using summary.csv, cells("mean") label nodepvar
You can add different summary stats to cells, for example: cells("mean sd min max") would show the mean, standard deviation, minimum, and maximum for each x in each country.
Hope this helps

Try
eststo X : qui estpost tabstat x1 x2 x3 x4 , by(country) stats(mean)
esttab X using summary.csv , cells("x1 x2 x3 x4") plain nomtitle nonumber noobs
The plain option is supposed to get convert the annoying ="0.143" to 0.145 in your output according to the documentation but it is not working for me.
You can use the output format xls instead, but then it prints all values in the same cell with five spaces between most but not the first two numbers of each line.
As in so many cases, you are better off saving your dataset and reading it into python with pd.read_stata() and then solving the problem there. Nowadays you can even call Python from within Stata, although that path leads to the dark side...

Related

Numpy Median Value Calculated not represented on BarPlot, How can I represent values according

Hi and thank you for visiting my post.
Here is working code that produces the median values
Wall_Median = pd.pivot_table(cleaned_pokedex, values="Wall", index ='Primary Type',aggfunc={"Wall": np.median})
Final_Wall_Median = Wall_Median.nlargest(18,'Wall')
print(Final_Wall_Median)
E.g Poison is 193 and the bar chart shows over 200
1. Wall Primary Type
Steel 259.0
Fairy 244.0
Dragon 237.0
Rock 235.5
Ground 235.0
Ice 230.0
Flying 220.0
Fighting 216.0
Ghost 215.0
Psychic 215.0
Grass 209.5
Water 208.0
Fire 204.0
Electric 201.0
Dark 200.0
Normal 194.0
Poison 193.0
Bug 180.0
Plotting the values using a seaborn bar chart does not produce the numeric value I receive from the code
fig = plt.gcf()
fig.set_size_inches(20,18)
ax = sns.barplot(x= cleaned_pokedex["Wall"],y= cleaned_pokedex["Primary Type"],data= Final_Wall_Median,palette = pkmn_type_colors)
Output
The bar values don't represent the medians printed. What can I do to fix this ?
It seems that you are actually plotting the mean with a CI band instead of the median as you intend to. That is because there is a small contradiction in your code:
ax = sns.barplot(x= cleaned_pokedex["Wall"],y= cleaned_pokedex["Primary Type"],data= Final_Wall_Median,palette = pkmn_type_colors)
you are telling seaborn to get the x and y values from cleaned_pokedex dataframe,
however, then you tell it to use data from the Final_Wall_Median dataframe.
So it seems that seaborn is arbitrarily choosing to use your y~x provided data, instead of the pre-aggregated Final_Wall_Median that you pass into data. Typically, you would use only x and y attributes if you just want to pass any two arrays (they don't need to be from the same dataframe), OR you can profile data as the dataframe you can't to use, and x and y as string column names (e.g. (x="Wall", y="Primary Type", data=cleaned_pokedex))
However, as pointed out, if you simply pass the "Wall", "Primary Type" dimensions into the x and y values of a barplot, seaborn will by default use the "mean" as the estimator.
The two options you have are:
sns.barplot(x=cleaned_pokedex["Wall"], y=cleaned_pokedex["Primary Type"], estimator=np.median)
# or
sns.barplot(x=Final_Wall_Median.Wall, y=Final_Wall_Median.index)
Since you've already pre-aggregated the medians, you can use Final_Wall_Median directly. The only difference is that you cannot get CI bands if you don't supply the raw data (the whole cleaned_pokedex dataframe, as in the first option).
barplot() takes a parameter estimator= that defines how the bar height is calculated. By default, this is done using mean(), but you can pass median if that's what you want:
ax = sns.barplot(..., estimator=np.median)

How to display Date (X Axis) info with mpldatacursor?

I'm working on a telemetry system, and right now I would like to see each scatter point in my plot with each pair of coordinates through clicks.
My plot is a time series, so I'm having a hard time to display each date with datacursor. I'm currently using this line
plt.gca().fmt_xdata = matplotlib.dates.DateFormatter('%H:%M:%S')
Which certifies me that my X axis is date-based.
I have already tried like this:
datacursor(ax1, formatter = 'Valor medido : {y:.6f} às {x:.6f}'.format)
The output is ok for Y, but the date come out as a "epoch number", like "57990.011454".
After a little research, I can convert this number with:
matplotlib.dates.num2date(d).strftime('%H:%M:%S')
but I'm failing to put it all together to display in my cursor.
Thanks in advance!
formatter= accepts any function that returns a string. You could therefore write (code untested because you did not provide a Minimal, Complete, and Verifiable example )
def print_coords(**kwargs):
return 'Valor medido : {y:.6f} às {x:s}'.format(y=kwargs['y'],
x=matplotlib.dates.num2date(kwargs['x']).strftime('%H:%M:%S'))
datacursor(ax1, formatter=print_coords)

Plot the distance between every two points in 2 D

If I have a table with three columns where the first column represents the name of each point, the second column represent numerical data (mean) and the last column represent (second column + fixed number). The following an example how is the data looks like:
I want to plot this table so I have the following figure
If it is possible how I can plot it using either Microsoft Excel or python or R (Bokeh).
Alright, I only know how to do it in ggplot2, I will answer regarding R here.
These method only works if the data-frame is in the format you provided above.
I rename your column to Name.of.Method, Mean, Mean.2.2
Preparation
Loading csv data into R
df <- read.csv('yourdata.csv', sep = ',')
Change column name (Do this if you don't want to change the code below or else you will need to go through each parameter to match your column names.
names(df) <- c("Name.of.Method", "Mean", "Mean.2.2")
Method 1 - Using geom_segment()
ggplot() +
geom_segment(data=df,aes(x = Mean,
y = Name.of.Method,
xend = Mean.2.2,
yend = Name.of.Method))
So as you can see, geom_segment allows us to specify the end position of the line (Hence, xend and yend)
However, it does not look similar to the image you have above.
The line shape seems to represent error bar. Therefore, ggplot provides us with an error bar function.
Method 2 - Using geom_errorbarh()
ggplot(df, aes(y = Name.of.Method, x = Mean)) +
geom_errorbarh(aes(xmin = Mean, xmax = Mean.2.2), linetype = 1, height = .2)
Usually we don't use this method just to draw a line. However, its functionality fits your requirement. You can see that we use xmin and ymin to specify the head and the tail of the line.
The height input is to adjust the height of the bar at the end of the line in both ends.
I would use hbar for this:
from bokeh.io import show, output_file
from bokeh.plotting import figure
output_file("intervals.html")
names = ["SMB", "DB", "SB", "TB"]
p = figure(y_range=names, plot_height=350)
p.hbar(y=names, left=[4,3,2,1], right=[6.2, 5.2, 4.2, 3.2], height=0.3)
show(p)
However Whisker would also be an option if you really want whiskers instead of interval bars.

plotting a 3D+colour scatter with gnuplot (on torch7)

I'm working with torch7, and I created a PCA function, which gives me an Nx3 tensor which I wish to plot (3D scatter).
I stored it in a file (file.dat).
now I want to plot it, I wrote the following lines
NOTE: those lines are in torch7(lua), but you don't really need to know the language, because the command gnuplot.raw("<command>") uses the regular gnuplot commands.
NOTE 2: I followed helpers on this forum to create this part, I probably read a relevant thread you might want to link here. If you do, please explain what's the difference between the linked explanation an what I did
gnuplot.raw("rgb(r,g,b) = 65536*r + 256*g + b")
gnuplot.raw("blue = rgb(0,0,200)")
gnuplot.raw("red = rgb(200,0,0)")
gnuplot.raw("layer = 1")
gnuplot.raw("splot './file.dat' using 1:2:3:(($4-layer)<0.1 ? red : blue) with points pt 7 linecolor rgb variable notitle")
cols 1 through 3 in file.dat are the x,y,z coordinates, col 4 is either 1 or 2 (determines colour).
LAST NOTE: my script doesn't print an error of any kind, it just doesn't plot the desired 3D scatter.
Thanks ahead

Correlation coefficient on gnuplot

I want to plot data using fit function : function f(x) = a+b*x**2. After ploting i have this result:
correlation matrix of the fit parameters:
m n
m 1.000
n -0.935 1.000
My question is : how can i found a correlation coefficient on gnuplot ?
You can use the stats command in gnuplot, which has syntax similar to the plot command:
stats "file.dat" using 2:(f($2)) name "A"
The correlation coefficient will be stored in the A_correlation variable. (With no name specification, it would be STATS_correlation.) You can use it subsequently to plot your data or just print on the screen using the set label command:
set label 1 sprintf("r = %4.2f",A_correlation) at graph 0.1, graph 0.85
You can find more about the stats command in gnuplot documentation.
Although there is no direct solution to this problem, a workaround is possible. I'll illustrate it using python/numpy. First, the part of the gnuplot script that generates the fit and connects with a python script:
file = "my_data.tsv"
f(x)=a+b*(x)
fit f(x) file using 2:3 via a,b
r = system(sprintf("python correlation.py %s",file))
ti = sprintf("y = %.2f + %.2fx (r = %s)", a, b, r)
plot \
file using 2:3 notitle,\
f(x) title ti
This runs correlation.py to retrieve the correlation 'r' in string format. It uses 'r' to generate a title for the fit line. Then, correlation.py:
from numpy import genfromtxt
from numpy import corrcoef
import sys
data = genfromtxt(sys.argv[1], delimiter='\t')
r = corrcoef(data[1:,1],data[1:,2])[0,1]
print("%.3f" % r).lstrip('0')
Here, the first row is assumed to be a header row. Furthermore, the columns to calculate the correlation for are now hardcoded to nr. 1 and 2. Of course, both settings can be changed and turned into arguments as well.
The resulting title of the fit line is (for a personal example):
y = 2.15 + 1.58x (r = .592)
Since you are probably using fit function you can first refer to this link to arrive at R2 values.
The link uses certain existing variables like FIT_WSSR, FIT_NDF to calculate R2 value.
The code for R2 is stated as:
SST = FIT_WSSR/(FIT_NDF+1)
SSE=FIT_WSSR/(FIT_NDF)
SSR=SST-SSE
R2=SSR/SST
The next step would be to show the R^2 values on the graph. Which can be achieved using the code :
set label 1 sprintf("r = %f",R2) at graph 0.7, graph 0.7
If you're looking for a way to calculate the correlation coefficient as defined on this page, you are out of luck using gnuplot as explained in this Google Groups thread.
There are lots of other tools for calculating correlation coefficients, e.g. numpy.

Resources