Bokeh Dodge Chart using Different Pandas DataFrame - python-3.x

everyone! So I have 2 dataframes extracted from Pro-Football-Reference as a csv and run through Pandas with the aid of StringIO.
I'm pasting only the header and a row of the info right below:
data_1999 = StringIO("""Tm,W,L,W-L%,PF,PA,PD,MoV,SoS,SRS,OSRS,DSRS Indianapolis Colts,13,3,.813,423,333,90,5.6,0.5,6.1,6.6,-0.5""")
data = StringIO("""Tm,W,L,T,WL%,PF,PA,PD,MoV,SoS,SRS,OSRS,DSRS Indianapolis Colts,10,6,0,.625,433,344,89,5.6,-2.2,3.4,3.9,-0.6""")
And then interpreted normally using pandas.read_csv, creating 2 different dataframes called df_nfl_1999 and df_nfl respectively.
So I was trying to use Bokeh and do something like here, except instead of 'apples' and 'pears' would be the name of the teams being the main grouping. I tried to emulate it by using only Pandas Dataframe info:
p9=figure(title='Comparison 1999 x 2018',background_fill_color='#efefef',x_range=df_nfl_1999['Tm'])
p9.xaxis.axis_label = 'Team'
p9.yaxis.axis_label = 'Variable'
p9.vbar(x=dodge(df_nfl_1999['Tm'],0.0,range=p9.x_range),top=df_nfl_1999['PF'],legend='PF in 1999', width=0.3)
p9.vbar(x=dodge(df_nfl_1999['Tm'],0.25,range=p9.x_range),top=df_nfl['PF'],legend='PF in 2018', width=0.3, color='#A6CEE3')
show(p9)
And the error I got was:
ValueError: expected an element of either String, Dict(Enum('expr',
'field', 'value', 'transform'), Either(String, Instance(Transform),
Instance(Expression), Float)) or Float, got {'field': 0
Washington Redskins
My initial idea was to group by Team Name (df_nfl['Tm']), analyzing the points in favor in each year (so df_nfl['PF'] for 2018 and df_nfl_1999['PF'] for 1999). A simple offset of the columns could resolve, but I can't seem to find a way to do this, other than the dodge chart, and it's not really working (I'm a newbie).
By the way, the error reference is appointed at happening on the:
p9.vbar(x=dodge(df_nfl_1999['Tm'],0.0,range=p9.x_range),top=df_nfl_1999['PF'],legend='PF in 1999', width=0.3)
I could use a scatter plot, for example, and both charts would coexist, and in some cases overlap (if the data is the same), but I was really aiming at plotting it side by side. The other answers related to the subject usually have older versions of Bokeh with deprecated functions.
Any way I can solve this? Thanks!
Edit:
Here is the .head() method. The other one will return exactly the same categories, columns and rows, except that obviously the data changes since it's from a different season.
Tm W L W-L% PF PA PD MoV SoS SRS OSRS \
0 Washington Redskins 10 6 0.625 443 377 66 4.1 -1.3 2.9 6.8
1 Dallas Cowboys 8 8 0.500 352 276 76 4.8 -1.6 3.1 -0.3
2 New York Giants 7 9 0.438 299 358 -59 -3.7 0.7 -3.0 -1.8
3 Arizona Cardinals 6 10 0.375 245 382 -137 -8.6 -0.2 -8.8 -5.5
4 Philadelphia Eagles 5 11 0.313 272 357 -85 -5.3 1.1 -4.2 -3.3
DSRS
0 -3.9
1 3.4
2 -1.2
3 -3.2
4 -0.9
And the value of executing just x=dodge returns:
dodge() missing 1 required positional argument: 'value'
By adding that argumento value=0.0 or value=0.2 the error returned is the same as the original post.

The first argument to dodge should a single column name of a column in a ColumnDataSource. The effect is then that any values from that column are dodged by the specified amount when used as coordinates.
You are trying to pass the contents of a column, which is is not expected. It's hard to say for sure without complete code to test, but you most likely want
x=dodge('Tm', ...)
However, you will also need to actually use an explicit Bokeh ColumnDataSource and pass that as source to vbar as is done in the example you link. You can construct one explicitly, but often times you can also just pass the dataframe directly source=df, and it will be adapted.

Related

Excel diagram with time value or number on category ax

I need to make a diagram which shows the lines of different ceramic firing schedules. I want them to be plotted in one diagram and they need to be plotted in time-relative ax. It needs to show the different durations in a right way. I don't seem to be able to achieve this.
What I have is the following:
First table:
Pendelen
Temp. per uur
Stooktemp.
Stooktijd 4
Stooktijd Cum.4
95
120
1:15:47
1,26
205
537
2:02:03
3,30
80
620
1:02:15
4,33
150
1075
3:02:00
7,37
50
1196
2:25:12
9,79
10
1196
0:10:00
9,95
Total
9:57:17
Second table:
Pendelen
Temp. per uur
Stooktemp.
Stooktijd 5
Stooktijd Cum.5
140
540
3:51:26
3,86
65
650
1:41:32
5,55
140
1095
3:10:43
8,73
50
1222
2:32:24
11,27
Total
11:16:05
The lines to be shown in a diagram should represent the 'stooktijd cum.' for both programs 4 and 5 (which is a cumulation of the time needed to fire up the kiln from it's previous temp. in the schedule). One should be able to see in the diagram that program 5 takes more time to reach it's endtemp.
What I achieved is nothing more than a diagram with two lines, but only plotted in the 'stooktijd cum.4' points from program 4. The image shows a screenshot of this diagram.
But as you can see, this doesn't look like program 5 takes more time to reach it's end. I would like it to show something like this:
Create this table :
p4
p5
0
10
3.86
540
5.55
650
8.73
1095
11.27
1222
0
0
1.26
120
3.3
537
4.33
620
7.37
1075
9.79
1196
9.95
1196
Select all > F11 > Design > Chg Chart type > scatter with straight line and marker
Here's my tryout :
Please share if it works/not. ( :

Pandas summing rows grouped by another column

I have attached dataset
Time podId Batt (avg) Temp (avg)
0 2019-10-07 9999 6.1 71.271053
1 2019-10-08 9999 6.0 71.208285
2 2019-10-09 9999 5.9 77.896628
3 2019-10-10 9999 5.8 78.709279
4 2019-10-11 9999 5.7 71.849283
59 2019-12-05 8888 5.5 76.548780
60 2019-12-06 8888 5.4 73.975295
61 2019-12-07 8888 5.3 76.209434
62 2019-12-08 8888 5.2 76.717481
63 2019-12-09 8888 5.1 70.433920
I imported it using- batt2 = pd.read_csv('battV2.csv')
I need to determine when battery change occurs, i.e. when Batt (avg) increases from previous row. I am able to do this by using the 'diff' in this manner batt2['Vdiff']=batt2['Batt (avg)'].diff(-1)
Now for each podId I need to sum the Vdiff column between battery changes, i.e. between two negative Vdiff values
Also I need to average Temp (avg) over the same range
Count Time to determine the number of days between battery changes
Thanks.
There are a couple of steps involved:
Import data
Be aware that I have changed your dataset a bit to provide a valid test case for your requirements (in your given dataset, Batt_avg never increases).
from io import StringIO
import pandas as pd
data = StringIO('''Time podId Batt_avg Temp_avg
0 2019-10-07 9999 6.1 71.271053
1 2019-10-08 9999 6.0 71.208285
2 2019-10-09 9999 5.9 77.896628
3 2019-10-10 9999 5.8 78.709279
4 2019-10-11 9999 5.7 71.849283
5 2019-10-12 9999 6.0 71.208285
6 2019-10-13 9999 5.9 77.896628
7 2019-10-14 9999 5.8 78.709279
8 2019-10-15 9999 5.7 71.849283
59 2019-12-05 8888 5.5 76.548780
60 2019-12-06 8888 5.4 73.975295
61 2019-12-07 8888 5.3 76.209434
62 2019-12-08 8888 5.2 76.717481
63 2019-12-09 8888 5.1 70.433920''')
df = pd.read_csv(data, delim_whitespace=True)
Determine changes in battery voltage
As you have already found out, you can do this with diff(). I am not certain that the code you have given with df.Batt_avg.diff(-1) satisfies your requirement of: "i.e. when Batt (avg) increases from previous row". Instead, for a given row, this shows how the value will change in the next row (multiplied by -1). If you need the negative change to the previous row, you can instead use -df.Batt_avg.diff().
df['Batt_avg_diff'] = df.Batt_avg.diff(-1)
Group data and apply the aggregation functions
You can express your grouping conditions as df.podId.diff().fillna(0.0) != 0 for the podIds and df.Batt_avg_diff.fillna(0.0) < 0 for the condition "between battery changes, i.e. between two negative Vdiff values" - either of these will trigger a new group. Use cumsum() on the triggers to create the groups. Then you can use groupby() to act on these groups and transform() to expand the results to the dimensions of the original dataframe.
df['group'] = ((df.podId.diff().fillna(0.0) != 0) | (df.Batt_avg_diff.fillna(0.0) < 0)).cumsum()
df['Batt_avg_diff_sum'] = df.Batt_avg_diff.groupby(df.group).transform('sum')
df['Temp_avg_mean'] = df.Temp_avg.groupby(df.group).transform('mean')
Datetime calculations
For the final step, you need to first convert the string to datetime to allow date operations. Then you can use groupby operations to get the max and min in each group, and take the delta.
df.Time = pd.to_datetime(df.Time)
df['Time_days'] = df.Time.groupby(df.group).transform('max') - df.Time.groupby(df.group).transform('min')
Note: if you do not need or want the aggregate data in the original dataframe, just apply the functions directly (without transform):
df_group = pd.DataFrame()
df_group['Batt_avg_diff_sum'] = df.Batt_avg_diff.groupby(df.group).sum()
df_group['Temp_avg_mean'] = df.Temp_avg.groupby(df.group).mean()
df_group['Time_days'] = df.Time.groupby(df.group).max() - df.Time.groupby(df.group).min()

Missing Date xticks on chart for matplotlib on Python 3. Bug?

I am following this section, I realize this code was made using Python 2 but they have xticks showing on the 'Start Date' axis and I do not. My chart only shows Start Date and no dates are provided. I have attempted to convert the object to datetime but that shows the dates and breaks the graph below it and the line is missing:
Graph
# Set as_index=False to keep the 0,1,2,... index. Then we'll take the mean of the polls on that day.
poll_df = poll_df.groupby(['Start Date'],as_index=False).mean()
# Let's go ahead and see what this looks like
poll_df.head()
Start Date Number of Observations Obama Romney Undecided Difference
0 2009-03-13 1403 44 44 12 0.00
1 2009-04-17 686 50 39 11 0.11
2 2009-05-14 1000 53 35 12 0.18
3 2009-06-12 638 48 40 12 0.08
4 2009-07-15 577 49 40 11 0.09
Great! Now plotting the Difference versus time should be straight forward.
# Plotting the difference in polls between Obama and Romney
fig = poll_df.plot('Start Date','Difference',figsize=(12,4),marker='o',linestyle='-',color='purple')
Notebook is here

How to convert bytes data into a python pandas dataframe?

I would like to convert 'bytes' data into a Pandas dataframe.
The data looks like this (few first lines):
(b'#Settlement Date,Settlement Period,CCGT,OIL,COAL,NUCLEAR,WIND,PS,NPSHYD,OCGT'
b',OTHER,INTFR,INTIRL,INTNED,INTEW,BIOMASS\n2017-01-01,1,7727,0,3815,7404,3'
b'923,0,944,0,2123,948,296,856,238,\n2017-01-01,2,8338,0,3815,7403,3658,16,'
b'909,0,2124,998,298,874,288,\n2017-01-01,3,7927,0,3801,7408,3925,0,864,0,2'
b'122,998,298,816,286,\n2017-01-01,4,6996,0,3803,7407,4393,0,863,0,2122,998'
The columns headers appear at the top. each subsequent line is a timestamp and numbers.
Is there a straightforward way to do this?
Thank you very much
#Paula Livingstone:
This seems to work:
s=str(bytes_data,'utf-8')
file = open("data.txt","w")
file.write(s)
df=pd.read_csv('data.txt')
maybe this can be done without using a file in between.
I had the same issue and found this library https://docs.python.org/2/library/stringio.html from the answer here: How to create a Pandas DataFrame from a string
Try something like:
from io import StringIO
s=str(bytes_data,'utf-8')
data = StringIO(s)
df=pd.read_csv(data)
You can also use BytesIO directly:
from io import BytesIO
df = pd.read_csv(BytesIO(bytes_data))
This will save you the step of transforming bytes_data to a string
Ok cool, your input formatting is quite awkward but the following works:
with open('file.txt', 'r') as myfile:
data=myfile.read().replace('\n', '') #read in file as a string
df = pd.Series(" ".join(data.strip(' b\'').strip('\'').split('\' b\'')).split('\\n')).str.split(',', expand=True)
print(df)
this produces the following:
0 1 2 3 4 5 6 7 \
0 #Settlement Date Settlement Period CCGT OIL COAL NUCLEAR WIND PS
1 2017-01-01 1 7727 0 3815 7404 3923 0
2 2017-01-01 2 8338 0 3815 7403 3658 16
3 2017-01-01 3 7927 0 3801 7408 3925 0
8 9 10 11 12 13 14 15
0 NPSHYD OCGT OTHER INTFR INTIRL INTNED INTEW BIOMASS
1 944 0 2123 948 296 856 238
2 909 0 2124 998 298 874 288
3 864 0 2122 998 298 816 286 None
In order for this to work you will need to ensure that your input file contains only a collection of complete rows. For this reason I removed the partial row for the purposes of the test.
As you have said that the data source is an http GET request then the initial read would take place using pandas.read_html.
More detail on this can be found here. Note specifically the section on io (io : str or file-like).

Labelling individual data points gnuplot

Just trying to get used to gnuplot. I searched a few pages on this site looking for the answer, read the documentation (4.6), and still haven't found the answer. say I have a data file like this:
0.0 0
1.0 25
2.0 55
3.0 110
4.0 456
5.0 554
6.0 345
and I want to label all the data points on the plot. How do I do this? I tried this suggestion plot 'exp.dat' u 1:2 w labels point offset character 0,character 1 tc rgb "blue" but it didn't work. It gave me a Not enough columns for this style response. I'm sure it's something I'm doing but I'm not sure what. Any help would be appreciated. Thanks.
I think you are missing strings for labels. You can do
flabel(y)=sprintf("y=%.2f", y)
plot '-' u 1:2:(flabel($2)) w labels point offset character 0,character 1 tc rgb "blue"
0.0 0
1.0 25
2.0 55
3.0 110
4.0 456
5.0 554
6.0 345

Resources