Need to plot histogram in Pandas such that x axis is categorical and y axis is sum of some column - python-3.x

I have a data frame in Pandas (using Python 3.7) as shown below:
# actuals probability bucket
# 0 0.0 0.116375 2
# 1 0.0 0.239069 3
# 2 1.0 0.591988 6
# 3 0.0 0.273709 3
# 4 1.0 0.929855 10
Where 'bucket' can take discreet values from 1 to 10. And 'actuals' can take only 2 values, either 1 or 0.
I need to plot a histogram such that x-axis = 'bucket' (i.e 1 to 10) and y-axis = Sum of 'actuals' . Then how can I do that?

Use groupby.sum with plot:
df.groupby('bucket')['actuals'].sum().plot(kind='bar')
If need histogram use kind='hist'

Related

Contour plots of noisy data - gridding and averaging

I am trying to make a contour plot from a dataframe in which the x and y coordinates are unevenly spaced and sometimes overlap and the z coordinate is noisy:
x y z
1 15.4707 174.6779 1592.811638
2 15.4707 171.3179 1304.953183
3 61.6107 108.2379 1687.233377
4 46.3707 151.6929 1688.368690
5 30.7107 124.5429 1339.451757
6 31.1307 202.8704 1616.756963
7 0.2307 141.5029 1620.288736
8 15.4707 141.9054 1167.798302
9 46.3707 72.0729 1687.546227
10 15.4707 212.6929 638.059709
What I'd like to do is to define a grid in x and y whose gridelines pass coordinates, say
x=[7.5, 22.5, 37.5, 52.5]
y=[60, 120, 180, 240]
In every grid section, I then take the average of the z values and make a new dataframe where the x and y columns are the centres of the grid sections and the z column is the aforementioned average. The dataframe should look something like
x y z
1 15 90 1621.1
2 30 150 1444.2
3 45 210 1651.7
From this stage it easy to get a contour plot using matplotlib.contourf or similar, but how can do this type of gridding and averaging? Is there an elegant way to do it in Pandas or other python packages?

How do you sum every nth column of data in gnuplot?

I would like to take an average over several columns of a data set in Gnuplot. The problem is that I want to average every other column (starting from the second column of my dataset). I was thinking of using every somehow but I still don't really understand when and where to use every. To help visualise my question: my data looks something like this:
x y1 z1 y2 z2
2 0.6 0 0.6 0
1 0.7 0 0.7 1
1 0.8 2 0.8 1
1 0.9 0 0.9 0
and I would like to average y1 and y2 and plot the result by doing something like:
stats filename nooutput
plot filename u 1:sum[col = every :2::2::STATS_columns] / ((STATS_columns-1)/2)
Not sure if this is anywhere close to doable though. Also, it would be nice to have a way of finding the number of columns used without any apriori knowledge of what the data looks like. In the example I have used my knowledge of the data to know that the average is over ((STATS_columns-1)/2) number of points.
Thank you for your response
From your code I assume you want to average y1 and y2 for each row and then plot it versus x (column 1). Since you have several identical x values, there would be another average, namely an average over the columns and over all identical x values.
I modified your data to better illustrate the difference.
I guess you were asking fot the red circles. The blue triangles are basically the average of the average, i.e. the average of the red points.
Check help summation and help smooth. sum has no step size with the index.
From gnuplot help:
sum [<var> = <start> : <end>] <expression>
Code:
### average over columns and smooth
reset session
$Data <<EOD
#x y1 z1 y2 z2
1 2.0 0 4.0 0
1 2.2 0 4.2 1
1 2.9 2 4.9 1
2 2.1 0 4.1 0
2 2.3 0 4.3 0
2 2.8 0 4.8 0
3 2.2 0 4.2 0
3 2.3 0 4.3 0
3 2.7 0 4.7 0
EOD
stats $Data nooutput
set offsets 0.5,0.5,0.5,0.5
Count = (STATS_columns-1)/2
plot $Data u 1:((sum[i=1:Count] column(i*2))/Count) w p pt 7 lc rgb "red" ti "average over y1,y2 columns for each row",\
$Data u 1:((sum[i=1:Count] column(i*2))/Count) smooth unique w p pt 9 lc rgb "blue" ti "average over y1,y2 for each x"
### end of code
Result:

heatmap color not relating with data in gnuplot

I am trying to create one heatmap using Gnuplot and my data file structure is looked like below:
6 5 4 3 1 0
3 2 2 0 0 1
0 0 0 0 1 0
0 0 0 0 2 3
0 0 1 2 4 3
the cell values are z values and columns represent y-axis and row are x-axes. that means the first value 6 is the z value where the y-axis is 5th position at x label zero. However, while plotting the heat map I am getting a different color which does not correlate with the z value. Also, I am getting five bins for the x-axis (which is supposed to be 6)and 4 bins (which is supposed to be 5) for the y-axis. My simple code is written below:
set pm3d map
splot 'm.txt' matrix
Please help me out of this confused situation.
Thanks.

How to build a scatter graph in excel with average y value for each x value

I am not sure that here is the best place to ask,
but I have summerized my program performance data in an excel file and I want to build a scatter graph.
For each x value I have 6 y values and I want my graph to contain the average of those 6 to each x.
Is there a way to do this in excel?
For example: I have
X Y
1 0.2
1 0
1 0
1 0.8
1 1.4
1 0
2 0.2
2 1.2
2 1
2 2.2
2 0
2 2.2
3 0.8
3 1.6
3 0
3 3.6
3 1.2
3 0.6
For each x I want my graph to contain the average y.
Thanks
Not certain what you want but suggest inserting a column (assumed to be B) immediately between your two existing ones and populating it with:
=AVERAGEIF(A:A,A2,C:C)
then plotting X against those values.
Or maybe better, just subtotal for each change in X with average for Y and plot that.

Heatmap with Gnuplot on a non-uniform grid

I would like to create a heatmap with gnuplot based on a non-uniform grid, meaning that my x axis bins do not have all the same width, and I can't figure out how to do that because when I plot my data with for example "with image" I get uniformly sized boxes which do no correspond to my coordinates at all (because "image" treats the data just as matrix I guess). So I would like to find a method to get non-uniform boxes which are also positioned in the right place on the Cartesian plane.
My data look something like this:
1 1 0.2
1 2 0.8
1 3 0.1
1 4 0.2
2 1 0.7
2 2 0.2
2 3 0.3
2 4 0.1
5 1 0.2
5 2 0.4
5 3 0.1
5 4 0.9
7 1 0.3
7 2 0.2
7 3 0.9
7 4 0.6
If I run this command on Gnuplot
set xrange [1:10]
p 'mydata.dat' with image
I get an image with 16 boxes that have the same width and height (apparently I don't have enough "reputation" on Stackoverflow to post an image, otherwise I would), but ideally I would like the boxes to have different widths and be in the right place on the plane. For example the first box should range from 1 to 2, the second one from 2 to 5, the third one from 5 to 7, and the last one from 7 to 10 (which is why I wrote set xrange [1:10]).
Could anyone help me please? Thank you very much!
The easiest (maybe only viable) way is to add some dummy data points and use splot ... with pm3d. This plotting style handles heatmaps with general quadrangles.
The image plotting style plots one box (one big pixel) for each data point, while pm3d takes each data point as corner of one or more quadrangles. The color of each quadrangles is determined by the values of the corners and is adjustable with set pm3d corners2color.
So, in your case you need to expand the 4x4 matrix to a 5x5 matrix (expand to right and top), but select the lower left corner to determine the color set pm3d corners2color c1.
The changed data file is then:
1 1 0.2
1 2 0.8
1 3 0.1
1 4 0.2
1 5 0.5
2 1 0.7
2 2 0.2
2 3 0.3
2 4 0.1
2 5 0.5
5 1 0.2
5 2 0.4
5 3 0.1
5 4 0.9
5 5 0.5
7 1 0.3
7 2 0.2
7 3 0.9
7 4 0.6
7 5 0.5
10 1 0.5
10 2 0.5
10 3 0.5
10 4 0.5
10 5 0.5
To plot it use
set pm3d map corners2color c1
set autoscale fix
set ytics 1
splot 'mydata.dat' using 1:($2-0.5):3 notitle
The result with 4.6.3 is:
In general, the z-value of the dummy data points doesn't matter, but in the above script it should lay somewhere between minimum and maximum values to allow set autoscale fix to work properly on the color scale.
If you don't want to change the data file manually, you could do it with some script, but that's a different question.
Here is an alternative solution without splot ... pm3d, but with boxxyerror.
If you plot data it should go as automatic as possible and there should be no need to "invent" and manually add data.
The following solution (a little bit more complex) takes care about the widths (+/-dx) and heights (+/-dy) of the boxes according to the following principle:
if it is an "inner" box, take half the distance to the adjacent datapoint on that side
if it is an "outer" box, take half the distance to the adjacent "inner" datapoint
Here, x-distances are irregular and y-distances are regular, but y-distances could also be irregular.
Data: SO19294342.dat
1 1 0.2
1 2 0.8
1 3 0.1
1 4 0.2
2 1 0.7
2 2 0.2
2 3 0.3
2 4 0.1
5 1 0.2
5 2 0.4
5 3 0.1
5 4 0.9
7 1 0.3
7 2 0.2
7 3 0.9
7 4 0.6
Script: (works with gnuplot>=4.6.0, March 2012)
### heatmap with boxxyerror and variable box-sizes
reset
FILE = "SO/SO19294342.dat"
set style fill solid 1.0
set tics out
set size ratio -1
# extract x-positions
Xs = Ys = ''
Nx = Ny = 0
b = -1
stats FILE u (column(-1)!=b ? (Nx=Nx+1, Xs=Xs.sprintf(" %g",$1), b=column(-1)) : 0, \
column(-1)==0 ? (Ny=Ny+1, Ys=Ys.sprintf(" %g",$2)) : 0) nooutput
d(vs,n0,n1) = abs(real(word(vs,n0))-real(word(vs,n1)))/2
dn(vs,n) = (n==1 ? (n0=1,n1=2) : (n0=n,n1=n-1), -d(vs,n0,n1))
dp(vs,n) = (Ns=words(vs), n==Ns ? (n0=Ns-1,n1=Ns) : (n0=n,n1=n+1), d(vs,n0,n1))
plot FILE u 1:2:($1+dn(Xs,column(-1)+1)):($1+dp(Xs,column(-1)+1)):\
($2+dn(Ys,int(column(0))%Ny+1)):($2+dp(Ys,int(column(0))%Ny+1)):3 w boxxy palette notitle
### end of script
For gnuplot>=4.6.5 you could add :xtic(1):xtic(2) to the plot command to only show your x- and y-coordinates as x,y-ticlabels.
plot FILE u 1:2:($1+dn(Xs,column(-1)+1)):($1+dp(Xs,column(-1)+1)):\
($2+dn(Ys,int(column(0))%Ny+1)):($2+dp(Ys,int(column(0))%Ny+1)):3:\
xtic(1):ytic(2) w boxxy palette notitle
And for gnuplot>=5.0.0 you could add noextend to the ranges to avoid white areas on the sides:
set xrange[:] noextend
set yrange[:] noextend
Result: (created with gnuplot 4.6.0)

Resources