Expanding sum with group by date - python-3.x

I have a dataframe where I'm trying to do an expanding sum of values and group them by date.
Specifically, my data looks like:
creationDateTime OK Fail
2017-01-06 21:30:00 4 0
2017-01-06 21:35:00 4 0
2017-01-06 21:36:00 4 0
2017-01-07 21:48:00 3 1
2017-01-07 21:53:00 4 0
2017-01-08 21:22:00 3 1
2017-01-08 21:27:00 3 1
2017-01-09 21:49:00 3 1
and I'm trying to get something similar to:
creationDateTime OK Fail RollingOK RollingFail
2017-01-06 21:30:00 4 0 4 0
2017-01-06 21:35:00 4 0 8 0
2017-01-06 21:36:00 4 0 12 0
2017-01-07 21:48:00 3 1 3 1
2017-01-07 21:53:00 4 0 7 1
2017-01-08 21:22:00 3 1 3 1
2017-01-08 21:27:00 3 1 6 2
2017-01-09 21:49:00 3 1 3 1
I've figured out how to do a rolling sum of the values by using:
data_aggregated['RollingOK'] = data_aggregated['OK'].expanding(0).sum()
data_aggregated['RollingFail'] = data_aggregated['Fail'].expanding(0).sum()
But I'm not sure how I can alter this to get the rolling sums grouped by day, since the code above does a rolling sum over all the rows, without grouping by day.
Any help is very much appreciated.

Use DataFrameGroupBy.cumsum with specified columns after groupby:
#if DatetimeIndex
idx = data_aggregated.index.date
#if column
#idx = data_aggregated['creationDateTime'].dt.date
data_aggregated[['RollingOK','RollingFail']] = (data_aggregated.groupby(idx)['OK','Fail']
.cumsum())
print (data_aggregated)
OK Fail RollingOK RollingFail
creationDateTime
2017-01-06 21:30:00 4 0 4 0
2017-01-06 21:35:00 4 0 8 0
2017-01-06 21:36:00 4 0 12 0
2017-01-07 21:48:00 3 1 3 1
2017-01-07 21:53:00 4 0 7 1
2017-01-08 21:22:00 3 1 3 1
2017-01-08 21:27:00 3 1 6 2
2017-01-09 21:49:00 3 1 3 1
You can also working with all columns:
data_aggregated = (data_aggregated.join(data_aggregated.groupby(idx)
.cumsum()
.add_prefix('Rolling')))
print (data_aggregated)
OK Fail RollingOK RollingFail
creationDateTime
2017-01-06 21:30:00 4 0 4 0
2017-01-06 21:35:00 4 0 8 0
2017-01-06 21:36:00 4 0 12 0
2017-01-07 21:48:00 3 1 3 1
2017-01-07 21:53:00 4 0 7 1
2017-01-08 21:22:00 3 1 3 1
2017-01-08 21:27:00 3 1 6 2
2017-01-09 21:49:00 3 1 3 1
Your solution should be changed:
data_aggregated[['RollingOK','RollingFail']] = (data_aggregated.groupby(idx)['OK','Fail']
.expanding(0)
.sum()
.reset_index(level=0, drop=True))

You can use, (if 1st column : creationDateTime is a column):
df['RollingOK']=df.groupby(df.creationDateTime.dt.date)['OK'].cumsum()
df['RollingFail']=df.groupby(df.creationDateTime.dt.date)['Fail'].cumsum()
print(df)
creationDateTime OK Fail RollingOK RollingFail
0 2017-01-06 21:30:00 4 0 4 0
1 2017-01-06 21:35:00 4 0 8 0
2 2017-01-06 21:36:00 4 0 12 0
3 2017-01-07 21:48:00 3 1 3 1
4 2017-01-07 21:53:00 4 0 7 1
5 2017-01-08 21:22:00 3 1 3 1
6 2017-01-08 21:27:00 3 1 6 2
7 2017-01-09 21:49:00 3 1 3 1

Related

New column based on values in row and a fixed column value in Pandas Dataframe

I have a dataframe that looks like
Date col_1 col_2 col_3
2022-08-20 5 B 1
2022-07-21 6 A 1
2022-07-20 2 A 1
2022-06-15 5 B 1
2022-06-11 3 C 1
2022-06-05 5 C 2
2022-06-01 3 B 2
2022-05-21 6 A 1
2022-05-13 6 A 0
2022-05-10 2 B 3
2022-04-11 2 C 3
2022-03-16 5 A 3
2022-02-20 5 B 1
and i want to add a new column col_new that cumcount the number of rows with the same elements in col_1 and col_2 but excluding that row itself and such that the element in col_3 is 1. So the desired output would look like
Date col_1 col_2 col_3 col_new
2022-08-20 5 B 1 3
2022-07-21 6 A 1 2
2022-07-20 2 A 1 1
2022-06-15 5 B 1 2
2022-06-11 3 C 1 1
2022-06-05 5 C 2 0
2022-06-01 3 B 2 0
2022-05-21 6 A 1 1
2022-05-13 6 A 0 0
2022-05-10 2 B 3 0
2022-04-11 2 C 3 0
2022-03-16 5 A 3 0
2022-02-20 5 B 1 1
And here's what I have tried:
Date = pd.to_datetime(df['Date'], dayfirst=True)
list_col_3_is_1 = (df
.assign(Date=Date)
.sort_values('Date', ascending=True)
['col_3'].eq(1))
df['col_new'] = (list_col_3_is_1.groupby(df[['col_1','col_2']]).apply(lambda g: g.shift(1, fill_value=0).cumsum()))
But then I got the following error: ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional
Thanks in advance.
Your solution should be changed:
df['col_new'] = list_col_3_is_1.groupby([df['col_1'],df['col_2']]).cumsum()
print (df)
Date col_1 col_2 col_3 col_new
0 2022-08-20 5 B 1 3
1 2022-07-21 6 A 1 2
2 2022-07-20 2 A 1 1
3 2022-06-15 5 B 1 2
4 2022-06-11 3 C 1 1
5 2022-06-05 5 C 2 0
6 2022-06-01 3 B 2 0
7 2022-05-21 6 A 1 1
8 2022-05-13 6 A 0 0
9 2022-05-10 2 B 3 0
10 2022-04-11 2 C 3 0
11 2022-03-16 5 A 3 0
12 2022-02-20 5 B 1 1
Assuming you already have the rows sorted in the desired order, you can use:
df['col_new'] = (df[::-1].assign(n=df['col_3'].eq(1))
.groupby(['col_1', 'col_2'])['n'].cumsum()
)
Output:
Date col_1 col_2 col_3 col_new
0 2022-08-20 5 B 1 3
1 2022-07-21 6 A 1 2
2 2022-07-20 2 A 1 1
3 2022-06-15 5 B 1 2
4 2022-06-11 3 C 1 1
5 2022-06-05 5 C 2 0
6 2022-06-01 3 B 2 0
7 2022-05-21 6 A 1 1
8 2022-05-13 6 A 0 0
9 2022-05-10 2 B 3 0
10 2022-04-11 2 C 3 0
11 2022-03-16 5 A 3 0
12 2022-02-20 5 B 1 1

How to repeat rows in dataframe with each values in a list?

I have a dataframe as follows
df = pd.DataFrame({
'DATE' : ['2015-12-01', '2015-12-01', '2015-12-02', '2015-12-02'],
'DAY_NUMBER' : [3, 3, 4, 4],
'HOUR' : [5, 6, 5, 6],
'count' : [12,11,14,15]
})
DATE DAY_NUMBER HOUR count
0 2015-12-01 3 5 12
1 2015-12-01 3 6 11
2 2015-12-02 4 5 14
3 2015-12-02 4 6 15
And I have a list extra_hours = [1,13]
I would like to create new rows in which HOUR column will be filled from extra_hours, and count=0', and repeat these row creation for each unique['DATE', 'DAY_NUMBER']`.
My Expected df is as follows.
DATE DAY_NUMBER HOUR count
0 2015-12-01 3 5 12.0
1 2015-12-01 3 6 11.0
2 2015-12-02 4 5 14.0
3 2015-12-02 4 6 15.0
0 2015-12-01 3 1 0.0
0 2015-12-01 3 13 0.0
2 2015-12-02 4 1 0.0
2 2015-12-02 4 13 0.0
Now I am creating the dataframe using the below code. I searched a lot, but couldn't find any easier solution. Any help is appreciated to improve the code and performance.
extra_df = df[['DATE', 'DAY_NUMBER']].sort_values('DATE').drop_duplicates()
extra_df['HOUR'] = np.array(extra_hours).reshape(1,len(extra_hours)).repeat(extra_df.shape[0], axis=0).tolist()
df.append(extra_df.explode('HOUR'), sort=False).fillna(0)
Use cross join with DataFrame.merge with helper DataFrame created by extra_hours list, last DataFrame.append to original:
extra_hours = [1,13]
extra_df = df[['DATE', 'DAY_NUMBER']].sort_values('DATE').drop_duplicates()
extra_df1 = pd.DataFrame({'HOUR':extra_hours, 'count':0, 'tmp':1})
df1 = extra_df.assign(tmp=1).merge(extra_df1, on='tmp').drop('tmp', 1)
extra_df = df.append(df1, sort=True, ignore_index=True)
print (extra_df)
DATE DAY_NUMBER HOUR count
0 2015-12-01 3 5 12
1 2015-12-01 3 6 11
2 2015-12-02 4 5 14
3 2015-12-02 4 6 15
4 2015-12-01 3 1 0
5 2015-12-01 3 13 0
6 2015-12-02 4 1 0
7 2015-12-02 4 13 0
Here is another approach using pd.MultiIndex.from_product():
extra_hours = [1,13]
uniq_dates=df['DATE'].unique()
extra_df = pd.DataFrame({'HOUR':extra_hours, 'count':0})
df1=pd.DataFrame.from_records(pd.MultiIndex.from_product([uniq_dates,extra_df.index]),
columns=['DATE','index']).set_index('index').assign(**extra_df)
final=df.append(df1,ignore_index=True,sort=False)
final['DAY_NUMBER']=final['DATE'].map(
df[['DATE', 'DAY_NUMBER']].drop_duplicates().set_index(['DATE'])['DAY_NUMBER'])
print(final)
DATE DAY_NUMBER HOUR count
0 2015-12-01 3 5 12
1 2015-12-01 3 6 11
2 2015-12-02 4 5 14
3 2015-12-02 4 6 15
4 2015-12-01 3 1 0
5 2015-12-01 3 13 0
6 2015-12-02 4 1 0
7 2015-12-02 4 13 0

KeyError: 'Dayofweek'

Seems fastai library is not working on Python. However, I have tried to add feature using the following lines of code with an objective that should identify whether a given day is
Monday/Friday or Tuesday/Wednesday/Thursday.
The code is as follows
data['mon_fri'] = 0
for i in range(0,len(data)):
if (data['Dayofweek'][i] == 0 or data['Dayofweek'][i] == 4):
data['mon_fri'][i] = 1
else:
data['mon_fri'][i] = 0
when i run, getting the following error -
KeyError: 'Dayofweek'
Can anyone help me on this?
New answer showing the entire dataframe.
In [51]: df = pd.DataFrame({'col1':range(9)})
In [52]: df['d'] = pd.date_range('2016-12-31','2017-01-08',freq='D')
In [53]: df
Out[53]:
col1 d
0 0 2016-12-31
1 1 2017-01-01
2 2 2017-01-02
3 3 2017-01-03
4 4 2017-01-04
5 5 2017-01-05
6 6 2017-01-06
7 7 2017-01-07
8 8 2017-01-08
Now adding column for day of the week
In [54]: df['dow'] = df['d'].dt.dayofweek
In [55]: df
Out[55]:
col1 d dow
0 0 2016-12-31 5
1 1 2017-01-01 6
2 2 2017-01-02 0
3 3 2017-01-03 1
4 4 2017-01-04 2
5 5 2017-01-05 3
6 6 2017-01-06 4
7 7 2017-01-07 5
8 8 2017-01-08 6
Finally doing the calculation, 1 for M/Th 0 for other days
In [56]: df['feature'] = df['dow'].apply(lambda x: int((x==1) or (x==4)))
In [57]: df
Out[57]:
col1 d dow feature
0 0 2016-12-31 5 0
1 1 2017-01-01 6 0
2 2 2017-01-02 0 0
3 3 2017-01-03 1 1
4 4 2017-01-04 2 0
5 5 2017-01-05 3 0
6 6 2017-01-06 4 1
7 7 2017-01-07 5 0
8 8 2017-01-08 6 0
Assuming you are using pandas you can just use the built in dayofweek function
In [32]: d = pd.date_range('2016-12-31', '2017-01-08', freq='D').to_series()
In [33]: d
Out[33]:
2016-12-31 2016-12-31
2017-01-01 2017-01-01
2017-01-02 2017-01-02
2017-01-03 2017-01-03
2017-01-04 2017-01-04
2017-01-05 2017-01-05
2017-01-06 2017-01-06
2017-01-07 2017-01-07
2017-01-08 2017-01-08
Freq: D, dtype: datetime64[ns]
In [34]: s = (d.dt.dayofweek==1) |(d.dt.dayofweek==4)
In [35]: s
Out[35]:
2016-12-31 False
2017-01-01 False
2017-01-02 False
2017-01-03 True
2017-01-04 False
2017-01-05 False
2017-01-06 True
2017-01-07 False
2017-01-08 False
Freq: D, dtype: bool
Then convert to 1/0 simply by
In [39]: t = s.apply(lambda x: int(x==True))
In [40]: t
Out[40]:
2016-12-31 0
2017-01-01 0
2017-01-02 0
2017-01-03 1
2017-01-04 0
2017-01-05 0
2017-01-06 1
2017-01-07 0
2017-01-08 0
Freq: D, dtype: int64

Dataframe concatenate columns

I have a dataframe with a multiindex (ID, Date, LID) and columns from 0 to N that looks something like this:
0 1 2 3 4
ID Date LID
00112 11-02-2014 I 0 1 5 6 7
00112 11-02-2014 II 2 4 5 3 4
00112 30-07-2015 I 5 7 1 1 2
00112 30-07-2015 II 3 2 8 7 1
I would like to group the dataframe by ID and Date and concatenate the columns to the same row such that it looks like this:
0 1 2 3 4 5 6 7 8 9
ID Date
00112 11-02-2014 0 1 5 6 7 2 4 5 3 4
00112 30-07-2015 5 7 1 1 2 3 2 8 7 1
Using pd.concat and pd.DataFrame.xs
pd.concat(
[df.xs(x, level=2) for x in df.index.levels[2]],
axis=1, ignore_index=True
)
0 1 2 3 4 5 6 7 8 9
ID Date
112 11-02-2014 0 1 5 6 7 2 4 5 3 4
30-07-2015 5 7 1 1 2 3 2 8 7 1
Use unstack + sort_index:
df = df.unstack().sort_index(axis=1, level=1)
#for new columns names
df.columns = np.arange(len(df.columns))
print (df)
0 1 2 3 4 5 6 7 8 9
ID Date
112 11-02-2014 0 1 5 6 7 2 4 5 3 4
30-07-2015 5 7 1 1 2 3 2 8 7 1

How to merge two dataframes with MultiIndex?

I have a frame looks like:
2015-12-30 2015-12-31
300100 am 1 3
pm 3 2
300200 am 5 1
pm 4 5
300300 am 2 6
pm 3 7
and the other frame looks like
2016-1-1 2016-1-2 2016-1-3 2016-1-4
300100 am 1 3 5 1
pm 3 2 4 5
300200 am 2 5 2 6
pm 5 1 3 7
300300 am 1 6 3 2
pm 3 7 2 3
300400 am 3 1 1 3
pm 2 5 5 2
300500 am 1 6 6 1
pm 5 7 7 5
Now I want to merge the two frames, and the frame after merge to be looked like this:
2015-12-30 2015-12-31 2016-1-1 2016-1-2 2016-1-3 2016-1-4
300100 am 1 3 1 3 5 1
pm 3 2 3 2 4 5
300200 am 5 1 2 5 2 6
pm 4 5 5 1 3 7
300300 am 2 6 1 6 3 2
pm 3 7 3 7 2 3
300400 am 3 1 1 3
pm 2 5 5 2
300500 am 1 6 6 1
pm 5 7 7 5
I tried pd.merge(frame1,frame2,right_index=True,left_index=True), but what it returned was not the desired format. Can anyone help? Thanks!
You can use concat:
print (pd.concat([frame1, frame2], axis=1))
2015-12-30 2015-12-31 1.1.2016 2.1.2016 3.1.2016 4.1.2016
300100 am 1.0 3.0 1 3 5 1
pm 3.0 2.0 3 2 4 5
300200 am 5.0 1.0 2 5 2 6
pm 4.0 5.0 5 1 3 7
300300 am 2.0 6.0 1 6 3 2
pm 3.0 7.0 3 7 2 3
300400 am NaN NaN 3 1 1 3
pm NaN NaN 2 5 5 2
300500 am NaN NaN 1 6 6 1
pm NaN NaN 5 7 7 5
Values in first and second column are converted to float, because NaN values convert int to float - see docs.
One possible solution is replace NaN by some int e.g. 0 and then convert to int:
print (pd.concat([frame1, frame2], axis=1)
.fillna(0)
.astype(int))
2015-12-30 2015-12-31 1.1.2016 2.1.2016 3.1.2016 4.1.2016
300100 am 1 3 1 3 5 1
pm 3 2 3 2 4 5
300200 am 5 1 2 5 2 6
pm 4 5 5 1 3 7
300300 am 2 6 1 6 3 2
pm 3 7 3 7 2 3
300400 am 0 0 3 1 1 3
pm 0 0 2 5 5 2
300500 am 0 0 1 6 6 1
pm 0 0 5 7 7 5
you can use join
frame1.join(frame2, how='outer')

Resources