Moving aggregate within a specified date range - python-3.x

Using a sample credit card transactions data below:
df = pd.DataFrame({
'card_id' : [1, 1, 1, 2, 2],
'date' : [datetime(2020, 6, random.randint(1, 14)) for i in range(5)],
'amount' : [random.randint(1, 100) for i in range(5)]})
df
card_id date amount
0 1 2020-06-07 11
1 1 2020-06-11 45
2 1 2020-06-14 87
3 2 2020-06-04 48
4 2 2020-06-12 76
I'm trying to take the total amount spent in the past 7 days of a card at the point of the transaction. For example, if card_id 1 made a transaction on June 8, I want to get the total transactions from June 1 to June 7. This is what I was hoping to get:
card_id date amount sum_past_7d
0 1 2020-06-07 11 0
1 1 2020-06-11 45 11
2 1 2020-06-14 87 56
3 2 2020-06-04 48 0
4 2 2020-06-12 76 48
I'm currently using this function and pd.apply to generate my desired column but it's taking too long on the actual data (> 1 million rows).
df['past_week'] = df['date'].apply(lambda x: x - timedelta(days=7))
def myfunction(x):
return df.loc[(df['card_id'] == x.card_id) & \
(df['date'] >= x.past_week) & \
(df['date'] < x.date), :]['amount'].sum()
Is there a faster and more efficient way to do this?

Let's try rolling on date with groupby:
# make sure the data is sorted properly
# your sample is already sorted, so you can skip this
df = df.sort_values(['card_id', 'date'])
df['sum_past_7D'] = (df.set_index('date').groupby('card_id')
['amount'].rolling('7D').sum()
.groupby('card_id').shift(fill_value=0)
.values
)
Output:
card_id date amount sum_past_7D
0 1 2020-06-07 11 0.0
1 1 2020-06-11 45 11.0
2 1 2020-06-14 87 56.0
3 2 2020-06-04 48 0.0
4 2 2020-06-12 76 48.0

Related

Analysis on dataframe with python

I want to be able to calculate the average 'goal','shot',and 'miss' per shooterName to use for further analysis and visualization
The code below gives me the count of the 3 attributes(shot,goal,miss) in the 'event' column sorted by 'shooterName'
Dataframe columns:
season period time teamCode event goal xCord yCord xCordAdjusted yCordAdjusted ... playerPositionThatDidEvent timeSinceFaceoff playerNumThatDidEvent shooterPlayerId shooterName shooterLeftRight shooterTimeOnIce shooterTimeOnIceSinceFaceoff shotDistance
Corresponding data
2020 1 16 PHI SHOT 0 -74 29 74 -29 ... C 16 11 8478439.0 Travis Konecny R 16 16 32.649655
2020 1 34 PIT SHOT 0 49 -25 49 -25 ... C 34 9 8478542.0 Evan Rodrigues R 34 34 47.169906
2020 1 65 PHI SHOT 0 -52 -31 52 31 ... L 65 86 8480797.0 Joel Farabee L 31 31 48.270074
2020 1 171 PIT SHOT 0 43 39 43 39 ... C 42 9 8478542.0 Evan Rodrigues R 42 42 60.307545
2020 1 209 PHI MISS 0 -46 33 46 -33 ... D 38 5 8479026.0 Philippe Myers R 38 38 54.203321
Current code:
dft['count'] = df.groupby(['shooterName', 'event'])['event'].agg(['count'])
dft
Current Output:
shooterName event count
A.J. Greer GOAL 1
MISS 6
SHOT 29
Aaron Downey GOAL 1
MISS 4
SHOT 35
Zenon Konopka GOAL 8
MISS 57
SHOT 176
Desired Output:
shooterName event count %totalshooterNameevents
A.J. Greer GOAL 1 .0277
MISS 6 .1666
SHOT 29 .805
Aaron Downey GOAL 1 .025
MISS 4 .1
SHOT 35 .875
Zenon Konopka GOAL 8 .0331
MISS 57 .236
SHOT 176 .7302
Something similar to this. My end goal is to be able to calculate each 'event' attribute as a percentage of the total 'event' by 'shooterName'. Below I added a column '%totalshooterNameevents' which is 'simply goal', 'shot', and 'miss' calculated by the sum of the 'goal, shot, and miss' per each 'shooterName'
Update
Try:
dft = df.groupby(['shooterName', 'event'])['event'].agg(['count']).reset_index()
dft['%total'] = dft.groupby('shooterName')['count'].apply(lambda x: x / sum(x))
print(dft)
# Output
shooterName event count %total
0 A.J. Greer GOAL 1 0.027778
1 A.J. Greer MISS 6 0.166667
2 A.J. Greer SHOT 29 0.805556
3 Aaron Downey GOAL 1 0.025000
4 Aaron Downey MISS 4 0.100000
5 Aaron Downey SHOT 35 0.875000
6 Zenon Konopka GOAL 8 0.033195
7 Zenon Konopka MISS 57 0.236515
8 Zenon Konopka SHOT 176 0.730290
Without sample, it's difficult to guess what you want. Try:
import pandas as pd
import numpy as np
# Setup a Minimal Reproducible Example
np.random.seed(2021)
df = pd.DataFrame({'shooterName': np.random.choice(list('AB'), 20),
'event': np.random.choice(['shot', 'goal', 'miss'], 20)})
# Create an empty dataframe?
dft = pd.DataFrame(index=df['shooterName'].unique())
# Do stuff
grp = df.groupby('shooterName')
dft['count'] = grp.count()
dft = dft.join(grp['event'].value_counts().unstack('event')
.div(dft['count'], axis=0))
Output:
>>> dft
count goal miss shot
A 12 0.416667 0.250 0.333333
B 8 0.500000 0.375 0.125000

How to sum by month in timestamp Data Frame?

i have dataframe like this :
trx_date
trx_amount
2013-02-11
35
2014-03-10
26
2011-02-9
10
2013-02-12
5
2013-01-11
21
how do i filter that into month and year? so that i can sum the trx_amount
example expected output :
trx_monthly
trx_sum
2013-02
40
2013-01
21
2014-02
35
You can convert values to month periods by Series.dt.to_period and then aggregate sum:
df['trx_date'] = pd.to_datetime(df['trx_date'])
df1 = (df.groupby(df['trx_date'].dt.to_period('m').rename('trx_monthly'))['trx_amount']
.sum()
.reset_index(name='trx_sum'))
print (df1)
trx_monthly trx_sum
0 2011-02 10
1 2013-01 21
2 2013-02 40
3 2014-03 26
Or convert datetimes to strings in format YYYY-MM by Series.dt.strftime:
df2 = (df.groupby(df['trx_date'].dt.strftime('%Y-%m').rename('trx_monthly'))['trx_amount']
.sum()
.reset_index(name='trx_sum'))
print (df2)
trx_monthly trx_sum
0 2011-02 10
1 2013-01 21
2 2013-02 40
3 2014-03 26
Or convert to month and years, then output is different - 3 columns:
df2 = (df.groupby([df['trx_date'].dt.year.rename('year'),
df['trx_date'].dt.month.rename('month')])['trx_amount']
.sum()
.reset_index(name='trx_sum'))
print (df2)
year month trx_sum
0 2011 2 10
1 2013 1 21
2 2013 2 40
3 2014 3 26
You can try this -
df['trx_month'] = df['trx_date'].dt.month
df_agg = df.groupby('trx_month')['trx_sum'].sum()

Grouping data based on month-year in pandas and then dropping all entries except the latest one- Python

Below is my example dataframe
Date Indicator Value
0 2000-01-30 A 30
1 2000-01-31 A 40
2 2000-03-30 C 50
3 2000-02-27 B 60
4 2000-02-28 B 70
5 2000-03-31 C 90
6 2000-03-28 C 100
7 2001-01-30 A 30
8 2001-01-31 A 40
9 2001-03-30 C 50
10 2001-02-27 B 60
11 2001-02-28 B 70
12 2001-03-31 C 90
13 2001-03-28 C 100
Desired Output
Date Indicator Value
2000-01-31 A 40
2000-02-28 B 70
2000-03-31 C 90
2001-01-31 A 40
2001-02-28 B 70
2001-03-31 C 90
I want to write a code that groups data by particular month-year and then keep the entry of latest date in that particular month-year and drop the rest. The data is till year 2020
I was only able to fetch the count by month-year. I am not able to drop create a proper code that helps to group data as per month-year and indicator and get the correct results
Use Series.dt.to_period for months periods, aggregate index of maximal date per groups by DataFrameGroupBy.idxmax and then pass to DataFrame.loc:
df['Date'] = pd.to_datetime(df['Date'])
print (df['Date'].dt.to_period('m'))
0 2000-01
1 2000-01
2 2000-03
3 2000-02
4 2000-02
5 2000-03
6 2000-03
7 2001-01
8 2001-01
9 2001-03
10 2001-02
11 2001-02
12 2001-03
13 2001-03
Name: Date, dtype: period[M]
df = df.loc[df.groupby(df['Date'].dt.to_period('m'))['Date'].idxmax()]
print (df)
Date Indicator Value
1 2000-01-31 A 40
4 2000-02-28 B 70
5 2000-03-31 C 90
8 2001-01-31 A 40
11 2001-02-28 B 70
12 2001-03-31 C 90

Pandas : How to iterate over only back dates over two ID columns to give unique IDs and count?

I have a dataset which has data for over a month. I need to count the players who have played on the previous day. So if I am looking at date of 5th June users I need to find the number of users who were there in dates preceding 5th June even once.
Dataset is something like this:-
Day pid1 pid2
1 1a 1b
1 1c 2e
1 1d 2w
1 1e 2q
2 1f 4r
2 1g 5t
2 2e 7u
2 2w 8i
2 2q 9o
3 4r 0yu
3 5t 5t
3 6t 1w
4 1a 2e
4 1f 9o
4 7u 6h
5 8i 4f
5 9o 3d
5 0yu 5g
5 5t 6h
I have tried iterating over days then pid1 and pid2 but to no avail and it is computationally expensive as I have over 5 million data points.
I really do not know how to approach this and the only thing I have tried is this:-
for x in range(1, 31):
for i in ids.iterrows():
if i['Ids'] == zip(df4['pid1'], df['pid2']):
print(x, i.count())
But it still doesn't let me iterate over only previous days and not next days.
I need answer that looks something like this (results are not accurate) but I need unique count of users of previous days on a given day:-
Day Previous day users
1 0
2 2
3 2
4 5
5 5
According to me you want to count the number of player Ids which has appeared on all days before the given Day. You can try the below:
m=(df.melt('Day').sort_values('Day').drop_duplicates(['Day','value'])
.reset_index(drop=True).drop('variable',1))
m.assign(k=m.groupby('value').cumcount()).groupby('Day')['k'].sum() #assign it back
Day
1 0
2 3
3 2
4 6
5 7
If cumulative counts are not required instead 1 appearance per cumulative day is required, use:
m.assign(k=m.groupby('value').cumcount().ne(0)).groupby('Day')['k'].sum() #.astype(int)
Day
1 0
2 3
3 2
4 5
5 5
Edit: After op's comment, I am providing both answers:
Solution for checking only one day before:
Instead of using two for loops and one if statement I used more pandas operations to increase computational speed
df.head()
Day pid1 pid2
0 1 1a 1b
1 1 1c 2e
2 1 1d 2w
3 1 1e 2q
4 2 1f 4r
Then groupby Day to get players in list:
tmp = df.groupby("Day").agg(list)
tmp
Day pid1 pid2
1 [1a, 1c, 1d, 1e] [1b, 2e, 2w, 2q]
2 [1f, 1g, 2e, 2w, 2q] [4r, 5t, 7u, 8i, 9o]
3 [4r, 5t, 6t] [0yu, 5t, 1w]
4 [1a, 1f, 7u] [2e, 9o, 6h]
5 [8i, 9o, 0yu, 5t] [4f, 3d, 5g, 6h]
Then concat ith day players and (i-1)th day players:
tmp2 = pd.DataFrame(tmp["pid1"] + tmp["pid2"], columns = ["current_day"])
tmp2["previous_day"] = tmp2.shift()
tmp2 = tmp2.fillna("nan")
tmp2
Day current_day previous_day
1 [1a, 1c, 1d, 1e, 1b, 2e, 2w, 2q] nan
2 [1f, 1g, 2e, 2w, 2q, 4r, 5t, 7u, 8i, 9o] [1a, 1c, 1d, 1e, 1b, 2e, 2w, 2q]
3 [4r, 5t, 6t, 0yu, 5t, 1w] [1f, 1g, 2e, 2w, 2q, 4r, 5t, 7u, 8i, 9o]
4 [1a, 1f, 7u, 2e, 9o, 6h] [4r, 5t, 6t, 0yu, 5t, 1w]
5 [8i, 9o, 0yu, 5t, 4f, 3d, 5g, 6h] [1a, 1f, 7u, 2e, 9o, 6h]
And finally finding length of intersection of days that is number of players that played at the current day and previous day.
tmp2.apply(lambda x: len(list(set(x["current_day"]) & set(x["previous_day"]))), axis = 1)
Day
1 0
2 3
3 2
4 0
5 2
dtype: int64
Solution for checking all previous days:
res = pd.DataFrame()
for day_num in df["Day"].unique():
tmp = df[df["Day"] == day_num]
tmp2 = pd.DataFrame(pd.concat([tmp["pid1"], tmp["pid2"]]).unique(), columns = ["players"])
tmp2["day"] = day_num
res = pd.concat([res, tmp2])
res = res.reset_index(drop = True)
This combines all pid1 and pid2 to players
res.head()
players day
0 1a 1
1 1c 1
2 1d 1
3 1e 1
4 1b 1
Then calculating all previous day players for a current day:
result = []
for day_num in df["Day"].unique():
current_players = pd.Series(res[res["day"] == day_num].players.unique())
previous_players = pd.Series(res[res["day"] < day_num].players.unique())
result.append(len(current_players[current_players.isin(previous_players)]))
result = pd.Series(result, index = df["Day"].unique())
result
The result in pd.Series format:
1 0
2 3
3 2
4 5
5 5
dtype: int64
Hope it works!

pandas calculate scores for each group based on multiple functions

I have the following df,
group_id code amount date
1 100 20 2017-10-01
1 100 25 2017-10-02
1 100 40 2017-10-03
1 100 25 2017-10-03
2 101 5 2017-11-01
2 102 15 2017-10-15
2 103 20 2017-11-05
I like to groupby group_id and then compute scores to each group based on the following features:
if code values are all the same in a group, score 0 and 10 otherwise;
if amount sum is > 100, score 20 and 0 otherwise;
sort_values by date in descending order and sum the differences between the dates, if the sum < 5, score 30, otherwise 0.
so the result df looks like,
group_id code amount date score
1 100 20 2017-10-01 50
1 100 25 2017-10-02 50
1 100 40 2017-10-03 50
1 100 25 2017-10-03 50
2 101 5 2017-11-01 10
2 102 15 2017-10-15 10
2 103 20 2017-11-05 10
here are the functions that correspond to each feature above:
def amount_score(df, amount_col, thold=100):
if df[amount_col].sum() > thold:
return 20
else:
return 0
def col_uniq_score(df, col_name):
if df[col_name].nunique() == 1:
return 0
else:
return 10
def date_diff_score(df, col_name):
df.sort_values(by=[col_name], ascending=False, inplace=True)
if df[col_name].diff().dropna().sum() / np.timedelta64(1, 'D') < 5:
return score + 30
else:
return score
I am wondering how to apply these functions to each group and calculate the sum of all the functions to give a score.
You can try groupby.transform for same size of Series as original DataFrame with numpy.where for if-else for Series:
grouped = df.sort_values('date', ascending=False).groupby('group_id', sort=False)
a = np.where(grouped['code'].transform('nunique') == 1, 0, 10)
print (a)
[10 10 10 0 0 0 0]
b = np.where(grouped['amount'].transform('sum') > 100, 20, 0)
print (b)
[ 0 0 0 20 20 20 20]
c = np.where(grouped['date'].transform(lambda x:x.diff().dropna().sum()).dt.days < 5, 30, 0)
print (c)
[30 30 30 30 30 30 30]
df['score'] = a + b + c
print (df)
group_id code amount date score
0 1 100 20 2017-10-01 40
1 1 100 25 2017-10-02 40
2 1 100 40 2017-10-03 40
3 1 100 25 2017-10-03 50
4 2 101 5 2017-11-01 50
5 2 102 15 2017-10-15 50
6 2 103 20 2017-11-05 50

Resources