Pandas : How to iterate over only back dates over two ID columns to give unique IDs and count? - python-3.x

I have a dataset which has data for over a month. I need to count the players who have played on the previous day. So if I am looking at date of 5th June users I need to find the number of users who were there in dates preceding 5th June even once.
Dataset is something like this:-
Day pid1 pid2
1 1a 1b
1 1c 2e
1 1d 2w
1 1e 2q
2 1f 4r
2 1g 5t
2 2e 7u
2 2w 8i
2 2q 9o
3 4r 0yu
3 5t 5t
3 6t 1w
4 1a 2e
4 1f 9o
4 7u 6h
5 8i 4f
5 9o 3d
5 0yu 5g
5 5t 6h
I have tried iterating over days then pid1 and pid2 but to no avail and it is computationally expensive as I have over 5 million data points.
I really do not know how to approach this and the only thing I have tried is this:-
for x in range(1, 31):
for i in ids.iterrows():
if i['Ids'] == zip(df4['pid1'], df['pid2']):
print(x, i.count())
But it still doesn't let me iterate over only previous days and not next days.
I need answer that looks something like this (results are not accurate) but I need unique count of users of previous days on a given day:-
Day Previous day users
1 0
2 2
3 2
4 5
5 5

According to me you want to count the number of player Ids which has appeared on all days before the given Day. You can try the below:
m=(df.melt('Day').sort_values('Day').drop_duplicates(['Day','value'])
.reset_index(drop=True).drop('variable',1))
m.assign(k=m.groupby('value').cumcount()).groupby('Day')['k'].sum() #assign it back
Day
1 0
2 3
3 2
4 6
5 7
If cumulative counts are not required instead 1 appearance per cumulative day is required, use:
m.assign(k=m.groupby('value').cumcount().ne(0)).groupby('Day')['k'].sum() #.astype(int)
Day
1 0
2 3
3 2
4 5
5 5

Edit: After op's comment, I am providing both answers:
Solution for checking only one day before:
Instead of using two for loops and one if statement I used more pandas operations to increase computational speed
df.head()
Day pid1 pid2
0 1 1a 1b
1 1 1c 2e
2 1 1d 2w
3 1 1e 2q
4 2 1f 4r
Then groupby Day to get players in list:
tmp = df.groupby("Day").agg(list)
tmp
Day pid1 pid2
1 [1a, 1c, 1d, 1e] [1b, 2e, 2w, 2q]
2 [1f, 1g, 2e, 2w, 2q] [4r, 5t, 7u, 8i, 9o]
3 [4r, 5t, 6t] [0yu, 5t, 1w]
4 [1a, 1f, 7u] [2e, 9o, 6h]
5 [8i, 9o, 0yu, 5t] [4f, 3d, 5g, 6h]
Then concat ith day players and (i-1)th day players:
tmp2 = pd.DataFrame(tmp["pid1"] + tmp["pid2"], columns = ["current_day"])
tmp2["previous_day"] = tmp2.shift()
tmp2 = tmp2.fillna("nan")
tmp2
Day current_day previous_day
1 [1a, 1c, 1d, 1e, 1b, 2e, 2w, 2q] nan
2 [1f, 1g, 2e, 2w, 2q, 4r, 5t, 7u, 8i, 9o] [1a, 1c, 1d, 1e, 1b, 2e, 2w, 2q]
3 [4r, 5t, 6t, 0yu, 5t, 1w] [1f, 1g, 2e, 2w, 2q, 4r, 5t, 7u, 8i, 9o]
4 [1a, 1f, 7u, 2e, 9o, 6h] [4r, 5t, 6t, 0yu, 5t, 1w]
5 [8i, 9o, 0yu, 5t, 4f, 3d, 5g, 6h] [1a, 1f, 7u, 2e, 9o, 6h]
And finally finding length of intersection of days that is number of players that played at the current day and previous day.
tmp2.apply(lambda x: len(list(set(x["current_day"]) & set(x["previous_day"]))), axis = 1)
Day
1 0
2 3
3 2
4 0
5 2
dtype: int64
Solution for checking all previous days:
res = pd.DataFrame()
for day_num in df["Day"].unique():
tmp = df[df["Day"] == day_num]
tmp2 = pd.DataFrame(pd.concat([tmp["pid1"], tmp["pid2"]]).unique(), columns = ["players"])
tmp2["day"] = day_num
res = pd.concat([res, tmp2])
res = res.reset_index(drop = True)
This combines all pid1 and pid2 to players
res.head()
players day
0 1a 1
1 1c 1
2 1d 1
3 1e 1
4 1b 1
Then calculating all previous day players for a current day:
result = []
for day_num in df["Day"].unique():
current_players = pd.Series(res[res["day"] == day_num].players.unique())
previous_players = pd.Series(res[res["day"] < day_num].players.unique())
result.append(len(current_players[current_players.isin(previous_players)]))
result = pd.Series(result, index = df["Day"].unique())
result
The result in pd.Series format:
1 0
2 3
3 2
4 5
5 5
dtype: int64
Hope it works!

Related

Moving aggregate within a specified date range

Using a sample credit card transactions data below:
df = pd.DataFrame({
'card_id' : [1, 1, 1, 2, 2],
'date' : [datetime(2020, 6, random.randint(1, 14)) for i in range(5)],
'amount' : [random.randint(1, 100) for i in range(5)]})
df
card_id date amount
0 1 2020-06-07 11
1 1 2020-06-11 45
2 1 2020-06-14 87
3 2 2020-06-04 48
4 2 2020-06-12 76
I'm trying to take the total amount spent in the past 7 days of a card at the point of the transaction. For example, if card_id 1 made a transaction on June 8, I want to get the total transactions from June 1 to June 7. This is what I was hoping to get:
card_id date amount sum_past_7d
0 1 2020-06-07 11 0
1 1 2020-06-11 45 11
2 1 2020-06-14 87 56
3 2 2020-06-04 48 0
4 2 2020-06-12 76 48
I'm currently using this function and pd.apply to generate my desired column but it's taking too long on the actual data (> 1 million rows).
df['past_week'] = df['date'].apply(lambda x: x - timedelta(days=7))
def myfunction(x):
return df.loc[(df['card_id'] == x.card_id) & \
(df['date'] >= x.past_week) & \
(df['date'] < x.date), :]['amount'].sum()
Is there a faster and more efficient way to do this?
Let's try rolling on date with groupby:
# make sure the data is sorted properly
# your sample is already sorted, so you can skip this
df = df.sort_values(['card_id', 'date'])
df['sum_past_7D'] = (df.set_index('date').groupby('card_id')
['amount'].rolling('7D').sum()
.groupby('card_id').shift(fill_value=0)
.values
)
Output:
card_id date amount sum_past_7D
0 1 2020-06-07 11 0.0
1 1 2020-06-11 45 11.0
2 1 2020-06-14 87 56.0
3 2 2020-06-04 48 0.0
4 2 2020-06-12 76 48.0

how to group a string in a column in python?

i have a dataframe
PROD TYPE QUANTI
0 wood i2 20
1 tv ut1 30
2 tabl il3 50
3 rmt z1 40
4 zet u1 60
5 rm t1 60
6 rt t2 80
7 dud i4 40
I want to group the column "TYPE" in-group categories of (i,u,z,y...etc)
Expected Output
PROD TYPE QUANTI
0 wood i_group 20
1 tv ut_group 30
2 tabl il_group 50
3 rmt z_group 40
4 zet y_group 60
5 rm t_group 60
6 rt t_group 80
7 dud i_group 40
Use Series.replace for replace number to _group:
df['TYPE'] = df['TYPE'].replace('\d+', '_group', regex=True)
print (df)
PROD TYPE QUANTI
0 wood i_group 20
1 tv ut_group 30
2 tabl il_group 50
3 rmt z_group 40
4 zet u_group 60
5 rm t_group 60
6 rt t_group 80
7 dud i_group 40
If possible some values with no number use:
df['TYPE'] = df['TYPE'].replace('\d+', '', regex=True) + '_group'

Grouping data based on month-year in pandas and then dropping all entries except the latest one- Python

Below is my example dataframe
Date Indicator Value
0 2000-01-30 A 30
1 2000-01-31 A 40
2 2000-03-30 C 50
3 2000-02-27 B 60
4 2000-02-28 B 70
5 2000-03-31 C 90
6 2000-03-28 C 100
7 2001-01-30 A 30
8 2001-01-31 A 40
9 2001-03-30 C 50
10 2001-02-27 B 60
11 2001-02-28 B 70
12 2001-03-31 C 90
13 2001-03-28 C 100
Desired Output
Date Indicator Value
2000-01-31 A 40
2000-02-28 B 70
2000-03-31 C 90
2001-01-31 A 40
2001-02-28 B 70
2001-03-31 C 90
I want to write a code that groups data by particular month-year and then keep the entry of latest date in that particular month-year and drop the rest. The data is till year 2020
I was only able to fetch the count by month-year. I am not able to drop create a proper code that helps to group data as per month-year and indicator and get the correct results
Use Series.dt.to_period for months periods, aggregate index of maximal date per groups by DataFrameGroupBy.idxmax and then pass to DataFrame.loc:
df['Date'] = pd.to_datetime(df['Date'])
print (df['Date'].dt.to_period('m'))
0 2000-01
1 2000-01
2 2000-03
3 2000-02
4 2000-02
5 2000-03
6 2000-03
7 2001-01
8 2001-01
9 2001-03
10 2001-02
11 2001-02
12 2001-03
13 2001-03
Name: Date, dtype: period[M]
df = df.loc[df.groupby(df['Date'].dt.to_period('m'))['Date'].idxmax()]
print (df)
Date Indicator Value
1 2000-01-31 A 40
4 2000-02-28 B 70
5 2000-03-31 C 90
8 2001-01-31 A 40
11 2001-02-28 B 70
12 2001-03-31 C 90

How to convert multi-indexed datetime index into integer?

I have a multi indexed dataframe(groupby object) as the result of groupby (by 'id' and 'date').
x y
id date
abc 3/1/1994 100 7
9/1/1994 90 8
3/1/1995 80 9
bka 5/1/1993 50 8
7/1/1993 40 9
I'd like to convert those dates into an integer-like, such as
x y
id date
abc day 0 100 7
day 1 90 8
day 2 80 9
bka day 0 50 8
day 1 40 9
I thought it would be simple but I couldn't get there easily. Is there a simple way to work on this?
Try this:
s = 'day ' + df.groupby(level=0).cumcount().astype(str)
df1 = df.set_index([s], append=True).droplevel(1)
x y
id
abc day 0 100 7
day 1 90 8
day 2 80 9
bka day 0 50 8
day 1 40 9
You can calculate the new level and create a new index:
lvl1 = 'day ' + df.groupby('id').cumcount().astype('str')
df.index = pd.MultiIndex.from_tuples((x,y) for x,y in zip(df.index.get_level_values('id'), lvl1) )
output:
x y
abc day 0 100 7
day 1 90 8
day 2 80 9
bka day 0 50 8
day 1 40 9

Taking all duplicate values in column as single value in pandas

My current dataframe is:
Name term Grade
0 A 1 35
1 A 2 40
2 B 1 50
3 B 2 45
I want to get a dataframe as:
Name term Grade
0 A 1 35
2 40
1 B 1 50
2 45
Is i possible to get like my expected output?If yes,How can i do it?
Use duplicated for boolean mask with numpy.where:
mask = df['Name'].duplicated()
#more general
#mask = df['Name'].ne(df['Name'].shift()).cumsum().duplicated()
df['Name'] = np.where(mask, '', df['Name'])
print (df)
Name term Grade
0 A 1 35
1 2 40
2 B 1 50
3 2 45
Difference between masks is possible seen in changed DataFrame:
print (df)
Name term Grade
0 A 1 35
1 A 2 40
2 B 1 50
3 B 2 45
4 A 4 43
5 A 3 46
If multiple same consecutive groups like 2 A groups need general solution:
mask = df['Name'].ne(df['Name'].shift()).cumsum().duplicated()
df['Name'] = np.where(mask, '', df['Name'])
print (df)
Name term Grade
0 A 1 35
1 2 40
2 B 1 50
3 2 45
4 A 4 43
5 3 46
mask = df['Name'].duplicated()
df['Name'] = np.where(mask, '', df['Name'])
print (df)
Name term Grade
0 A 1 35
1 2 40
2 B 1 50
3 2 45
4 4 43
5 3 46

Resources