Pandas summing rows grouped by another column - pandas-groupby

I have attached dataset
Time podId Batt (avg) Temp (avg)
0 2019-10-07 9999 6.1 71.271053
1 2019-10-08 9999 6.0 71.208285
2 2019-10-09 9999 5.9 77.896628
3 2019-10-10 9999 5.8 78.709279
4 2019-10-11 9999 5.7 71.849283
59 2019-12-05 8888 5.5 76.548780
60 2019-12-06 8888 5.4 73.975295
61 2019-12-07 8888 5.3 76.209434
62 2019-12-08 8888 5.2 76.717481
63 2019-12-09 8888 5.1 70.433920
I imported it using- batt2 = pd.read_csv('battV2.csv')
I need to determine when battery change occurs, i.e. when Batt (avg) increases from previous row. I am able to do this by using the 'diff' in this manner batt2['Vdiff']=batt2['Batt (avg)'].diff(-1)
Now for each podId I need to sum the Vdiff column between battery changes, i.e. between two negative Vdiff values
Also I need to average Temp (avg) over the same range
Count Time to determine the number of days between battery changes
Thanks.

There are a couple of steps involved:
Import data
Be aware that I have changed your dataset a bit to provide a valid test case for your requirements (in your given dataset, Batt_avg never increases).
from io import StringIO
import pandas as pd
data = StringIO('''Time podId Batt_avg Temp_avg
0 2019-10-07 9999 6.1 71.271053
1 2019-10-08 9999 6.0 71.208285
2 2019-10-09 9999 5.9 77.896628
3 2019-10-10 9999 5.8 78.709279
4 2019-10-11 9999 5.7 71.849283
5 2019-10-12 9999 6.0 71.208285
6 2019-10-13 9999 5.9 77.896628
7 2019-10-14 9999 5.8 78.709279
8 2019-10-15 9999 5.7 71.849283
59 2019-12-05 8888 5.5 76.548780
60 2019-12-06 8888 5.4 73.975295
61 2019-12-07 8888 5.3 76.209434
62 2019-12-08 8888 5.2 76.717481
63 2019-12-09 8888 5.1 70.433920''')
df = pd.read_csv(data, delim_whitespace=True)
Determine changes in battery voltage
As you have already found out, you can do this with diff(). I am not certain that the code you have given with df.Batt_avg.diff(-1) satisfies your requirement of: "i.e. when Batt (avg) increases from previous row". Instead, for a given row, this shows how the value will change in the next row (multiplied by -1). If you need the negative change to the previous row, you can instead use -df.Batt_avg.diff().
df['Batt_avg_diff'] = df.Batt_avg.diff(-1)
Group data and apply the aggregation functions
You can express your grouping conditions as df.podId.diff().fillna(0.0) != 0 for the podIds and df.Batt_avg_diff.fillna(0.0) < 0 for the condition "between battery changes, i.e. between two negative Vdiff values" - either of these will trigger a new group. Use cumsum() on the triggers to create the groups. Then you can use groupby() to act on these groups and transform() to expand the results to the dimensions of the original dataframe.
df['group'] = ((df.podId.diff().fillna(0.0) != 0) | (df.Batt_avg_diff.fillna(0.0) < 0)).cumsum()
df['Batt_avg_diff_sum'] = df.Batt_avg_diff.groupby(df.group).transform('sum')
df['Temp_avg_mean'] = df.Temp_avg.groupby(df.group).transform('mean')
Datetime calculations
For the final step, you need to first convert the string to datetime to allow date operations. Then you can use groupby operations to get the max and min in each group, and take the delta.
df.Time = pd.to_datetime(df.Time)
df['Time_days'] = df.Time.groupby(df.group).transform('max') - df.Time.groupby(df.group).transform('min')
Note: if you do not need or want the aggregate data in the original dataframe, just apply the functions directly (without transform):
df_group = pd.DataFrame()
df_group['Batt_avg_diff_sum'] = df.Batt_avg_diff.groupby(df.group).sum()
df_group['Temp_avg_mean'] = df.Temp_avg.groupby(df.group).mean()
df_group['Time_days'] = df.Time.groupby(df.group).max() - df.Time.groupby(df.group).min()

Related

VBA solution of VIF factors [EXCEL]

I have several multiple linear regressions to carry out, I am wondering if there is a VBA solution for getting the VIF of regression outputs for different equations.
My current data format:
i=1
Year DependantVariable Variable2 Variable3 Variable4 Variable5 ....
2009 100 10 20 -
2010 110 15 25 -
2011 115 20 30 -
2012 125 25 35 -
2013 130 25 40 -
I have the above table, with the value of i determining the value of the variables (essentially, different regression input tables in place for every value of i)
I am looking for a VBA that will check every value of i (stored in a column), calculate the VIF for every value of i and output something like below
ivalue variable1VIF variable2VIF ...
1 1.1 1.3
2 1.2 10.1

Missing Date xticks on chart for matplotlib on Python 3. Bug?

I am following this section, I realize this code was made using Python 2 but they have xticks showing on the 'Start Date' axis and I do not. My chart only shows Start Date and no dates are provided. I have attempted to convert the object to datetime but that shows the dates and breaks the graph below it and the line is missing:
Graph
# Set as_index=False to keep the 0,1,2,... index. Then we'll take the mean of the polls on that day.
poll_df = poll_df.groupby(['Start Date'],as_index=False).mean()
# Let's go ahead and see what this looks like
poll_df.head()
Start Date Number of Observations Obama Romney Undecided Difference
0 2009-03-13 1403 44 44 12 0.00
1 2009-04-17 686 50 39 11 0.11
2 2009-05-14 1000 53 35 12 0.18
3 2009-06-12 638 48 40 12 0.08
4 2009-07-15 577 49 40 11 0.09
Great! Now plotting the Difference versus time should be straight forward.
# Plotting the difference in polls between Obama and Romney
fig = poll_df.plot('Start Date','Difference',figsize=(12,4),marker='o',linestyle='-',color='purple')
Notebook is here

Bokeh Dodge Chart using Different Pandas DataFrame

everyone! So I have 2 dataframes extracted from Pro-Football-Reference as a csv and run through Pandas with the aid of StringIO.
I'm pasting only the header and a row of the info right below:
data_1999 = StringIO("""Tm,W,L,W-L%,PF,PA,PD,MoV,SoS,SRS,OSRS,DSRS Indianapolis Colts,13,3,.813,423,333,90,5.6,0.5,6.1,6.6,-0.5""")
data = StringIO("""Tm,W,L,T,WL%,PF,PA,PD,MoV,SoS,SRS,OSRS,DSRS Indianapolis Colts,10,6,0,.625,433,344,89,5.6,-2.2,3.4,3.9,-0.6""")
And then interpreted normally using pandas.read_csv, creating 2 different dataframes called df_nfl_1999 and df_nfl respectively.
So I was trying to use Bokeh and do something like here, except instead of 'apples' and 'pears' would be the name of the teams being the main grouping. I tried to emulate it by using only Pandas Dataframe info:
p9=figure(title='Comparison 1999 x 2018',background_fill_color='#efefef',x_range=df_nfl_1999['Tm'])
p9.xaxis.axis_label = 'Team'
p9.yaxis.axis_label = 'Variable'
p9.vbar(x=dodge(df_nfl_1999['Tm'],0.0,range=p9.x_range),top=df_nfl_1999['PF'],legend='PF in 1999', width=0.3)
p9.vbar(x=dodge(df_nfl_1999['Tm'],0.25,range=p9.x_range),top=df_nfl['PF'],legend='PF in 2018', width=0.3, color='#A6CEE3')
show(p9)
And the error I got was:
ValueError: expected an element of either String, Dict(Enum('expr',
'field', 'value', 'transform'), Either(String, Instance(Transform),
Instance(Expression), Float)) or Float, got {'field': 0
Washington Redskins
My initial idea was to group by Team Name (df_nfl['Tm']), analyzing the points in favor in each year (so df_nfl['PF'] for 2018 and df_nfl_1999['PF'] for 1999). A simple offset of the columns could resolve, but I can't seem to find a way to do this, other than the dodge chart, and it's not really working (I'm a newbie).
By the way, the error reference is appointed at happening on the:
p9.vbar(x=dodge(df_nfl_1999['Tm'],0.0,range=p9.x_range),top=df_nfl_1999['PF'],legend='PF in 1999', width=0.3)
I could use a scatter plot, for example, and both charts would coexist, and in some cases overlap (if the data is the same), but I was really aiming at plotting it side by side. The other answers related to the subject usually have older versions of Bokeh with deprecated functions.
Any way I can solve this? Thanks!
Edit:
Here is the .head() method. The other one will return exactly the same categories, columns and rows, except that obviously the data changes since it's from a different season.
Tm W L W-L% PF PA PD MoV SoS SRS OSRS \
0 Washington Redskins 10 6 0.625 443 377 66 4.1 -1.3 2.9 6.8
1 Dallas Cowboys 8 8 0.500 352 276 76 4.8 -1.6 3.1 -0.3
2 New York Giants 7 9 0.438 299 358 -59 -3.7 0.7 -3.0 -1.8
3 Arizona Cardinals 6 10 0.375 245 382 -137 -8.6 -0.2 -8.8 -5.5
4 Philadelphia Eagles 5 11 0.313 272 357 -85 -5.3 1.1 -4.2 -3.3
DSRS
0 -3.9
1 3.4
2 -1.2
3 -3.2
4 -0.9
And the value of executing just x=dodge returns:
dodge() missing 1 required positional argument: 'value'
By adding that argumento value=0.0 or value=0.2 the error returned is the same as the original post.
The first argument to dodge should a single column name of a column in a ColumnDataSource. The effect is then that any values from that column are dodged by the specified amount when used as coordinates.
You are trying to pass the contents of a column, which is is not expected. It's hard to say for sure without complete code to test, but you most likely want
x=dodge('Tm', ...)
However, you will also need to actually use an explicit Bokeh ColumnDataSource and pass that as source to vbar as is done in the example you link. You can construct one explicitly, but often times you can also just pass the dataframe directly source=df, and it will be adapted.

sort pyspark dataframe within groups

I would like to sort column "time" within each "id" group.
The data looks like:
id time name
132 12 Lucy
132 10 John
132 15 Sam
78 11 Kate
78 7 Julia
78 2 Vivien
245 22 Tom
I would like to get this:
id time name
132 10 John
132 12 Lucy
132 15 Sam
78 2 Vivien
78 7 Julia
78 11 Kate
245 22 Tom
I tried
df.orderby(['id','time'])
But I don't need to sort "id".
I have two questions:
Can I just sort "time" within same "id"? and How?
Will be more efficient if I just sort "time" than using orderby() to sort both columns?
This is exactly what windowing is for.
You can create a window partitioned by the "id" column and sorted by the "time" column. Next you can apply any function on that window.
# Create a Window
from pyspark.sql.window import Window
w = Window.partitionBy(df.id).orderBy(df.time)
Now use this window over any function:
For e.g.: let's say you want to create a column of the time delta between each row within the same group
import pyspark.sql.functions as f
df = df.withColumn("timeDelta", df.time - f.lag(df.time,1).over(w))
I hope this gives you an idea. Effectively you have sorted your dataframe using the window and can now apply any function to it.
If you just want to view your result, you could find the row number and sort by that as well.
df.withColumn("order", f.row_number().over(w)).sort("order").show()

Modifying A SAS Data set after it was created using PROC IMPORT

I have a dataset like this
Obs MinNo EurNo MinLav EurLav
1 103 15.9 92 21.9
2 68 18.5 126 18.5
3 79 15.9 114 22.3
My goal is to create a data set like this from the dataset above:
Obs Min Eur Lav
1 103 15.9 No
2 92 21.9 Yes
3 68 18.5 No
4 126 18.5 Yes
5 79 15.9 No
6 114 22.3 Yes
Basically I'm taking the 4 columns and appending them into 2 columns + a Categorical indicating which set of 2 columns they came from
Here's what I have so far
PROC IMPORT DATAFILE='f:\data\order_effect.xls' DBMS=XLS OUT=orderEffect;
RUN;
DATA temp;
INFILE orderEffect;
INPUT minutes euros ##;
IF MOD(_N_,2)^=0 THEN lav='Yes';
ELSE lav='No';
RUN;
My question though is how I can I import an Excel sheet but then modify the SAS dataset it creates so I can shove the second two columns below the first two and add a third column based on which columns in came from?
I know how to do this by splitting the dataset into two datasets then appending one onto the other but with the mode function above it would be a lot faster.
You were very close, but misunderstanding what PROC IMPORT does.
When PROC EXPORT completes, it will have created a SAS data set named orderEffect containing SAS variables from the columns in your worksheet. You just need to do a little data step program to give the result you want. Try this:
data want;
/* Define the SAS variables you want to keep */
format Min 8. Eur 8.1;
length Lav $3;
keep Min Eur Lav;
set orderEffect;
Min = MinNo;
Eur = EurNo;
Lav = 'No';
output;
Min = MinLav;
Eur = EurLav;
Lav = 'Yes';
output;
run;
This assumes that the PROC IMPORT step created a data set with those names. Run that step first to be sure and revise the program if necessary.

Resources