Pandas : How to iterate over only back dates over two ID columns to give unique IDs and count? - python-3.x
I have a dataset which has data for over a month. I need to count the players who have played on the previous day. So if I am looking at date of 5th June users I need to find the number of users who were there in dates preceding 5th June even once.
Dataset is something like this:-
Day pid1 pid2
1 1a 1b
1 1c 2e
1 1d 2w
1 1e 2q
2 1f 4r
2 1g 5t
2 2e 7u
2 2w 8i
2 2q 9o
3 4r 0yu
3 5t 5t
3 6t 1w
4 1a 2e
4 1f 9o
4 7u 6h
5 8i 4f
5 9o 3d
5 0yu 5g
5 5t 6h
I have tried iterating over days then pid1 and pid2 but to no avail and it is computationally expensive as I have over 5 million data points.
I really do not know how to approach this and the only thing I have tried is this:-
for x in range(1, 31):
for i in ids.iterrows():
if i['Ids'] == zip(df4['pid1'], df['pid2']):
print(x, i.count())
But it still doesn't let me iterate over only previous days and not next days.
I need answer that looks something like this (results are not accurate) but I need unique count of users of previous days on a given day:-
Day Previous day users
1 0
2 2
3 2
4 5
5 5
According to me you want to count the number of player Ids which has appeared on all days before the given Day. You can try the below:
m=(df.melt('Day').sort_values('Day').drop_duplicates(['Day','value'])
.reset_index(drop=True).drop('variable',1))
m.assign(k=m.groupby('value').cumcount()).groupby('Day')['k'].sum() #assign it back
Day
1 0
2 3
3 2
4 6
5 7
If cumulative counts are not required instead 1 appearance per cumulative day is required, use:
m.assign(k=m.groupby('value').cumcount().ne(0)).groupby('Day')['k'].sum() #.astype(int)
Day
1 0
2 3
3 2
4 5
5 5
Edit: After op's comment, I am providing both answers:
Solution for checking only one day before:
Instead of using two for loops and one if statement I used more pandas operations to increase computational speed
df.head()
Day pid1 pid2
0 1 1a 1b
1 1 1c 2e
2 1 1d 2w
3 1 1e 2q
4 2 1f 4r
Then groupby Day to get players in list:
tmp = df.groupby("Day").agg(list)
tmp
Day pid1 pid2
1 [1a, 1c, 1d, 1e] [1b, 2e, 2w, 2q]
2 [1f, 1g, 2e, 2w, 2q] [4r, 5t, 7u, 8i, 9o]
3 [4r, 5t, 6t] [0yu, 5t, 1w]
4 [1a, 1f, 7u] [2e, 9o, 6h]
5 [8i, 9o, 0yu, 5t] [4f, 3d, 5g, 6h]
Then concat ith day players and (i-1)th day players:
tmp2 = pd.DataFrame(tmp["pid1"] + tmp["pid2"], columns = ["current_day"])
tmp2["previous_day"] = tmp2.shift()
tmp2 = tmp2.fillna("nan")
tmp2
Day current_day previous_day
1 [1a, 1c, 1d, 1e, 1b, 2e, 2w, 2q] nan
2 [1f, 1g, 2e, 2w, 2q, 4r, 5t, 7u, 8i, 9o] [1a, 1c, 1d, 1e, 1b, 2e, 2w, 2q]
3 [4r, 5t, 6t, 0yu, 5t, 1w] [1f, 1g, 2e, 2w, 2q, 4r, 5t, 7u, 8i, 9o]
4 [1a, 1f, 7u, 2e, 9o, 6h] [4r, 5t, 6t, 0yu, 5t, 1w]
5 [8i, 9o, 0yu, 5t, 4f, 3d, 5g, 6h] [1a, 1f, 7u, 2e, 9o, 6h]
And finally finding length of intersection of days that is number of players that played at the current day and previous day.
tmp2.apply(lambda x: len(list(set(x["current_day"]) & set(x["previous_day"]))), axis = 1)
Day
1 0
2 3
3 2
4 0
5 2
dtype: int64
Solution for checking all previous days:
res = pd.DataFrame()
for day_num in df["Day"].unique():
tmp = df[df["Day"] == day_num]
tmp2 = pd.DataFrame(pd.concat([tmp["pid1"], tmp["pid2"]]).unique(), columns = ["players"])
tmp2["day"] = day_num
res = pd.concat([res, tmp2])
res = res.reset_index(drop = True)
This combines all pid1 and pid2 to players
res.head()
players day
0 1a 1
1 1c 1
2 1d 1
3 1e 1
4 1b 1
Then calculating all previous day players for a current day:
result = []
for day_num in df["Day"].unique():
current_players = pd.Series(res[res["day"] == day_num].players.unique())
previous_players = pd.Series(res[res["day"] < day_num].players.unique())
result.append(len(current_players[current_players.isin(previous_players)]))
result = pd.Series(result, index = df["Day"].unique())
result
The result in pd.Series format:
1 0
2 3
3 2
4 5
5 5
dtype: int64
Hope it works!
Related
Moving aggregate within a specified date range
Using a sample credit card transactions data below: df = pd.DataFrame({ 'card_id' : [1, 1, 1, 2, 2], 'date' : [datetime(2020, 6, random.randint(1, 14)) for i in range(5)], 'amount' : [random.randint(1, 100) for i in range(5)]}) df card_id date amount 0 1 2020-06-07 11 1 1 2020-06-11 45 2 1 2020-06-14 87 3 2 2020-06-04 48 4 2 2020-06-12 76 I'm trying to take the total amount spent in the past 7 days of a card at the point of the transaction. For example, if card_id 1 made a transaction on June 8, I want to get the total transactions from June 1 to June 7. This is what I was hoping to get: card_id date amount sum_past_7d 0 1 2020-06-07 11 0 1 1 2020-06-11 45 11 2 1 2020-06-14 87 56 3 2 2020-06-04 48 0 4 2 2020-06-12 76 48 I'm currently using this function and pd.apply to generate my desired column but it's taking too long on the actual data (> 1 million rows). df['past_week'] = df['date'].apply(lambda x: x - timedelta(days=7)) def myfunction(x): return df.loc[(df['card_id'] == x.card_id) & \ (df['date'] >= x.past_week) & \ (df['date'] < x.date), :]['amount'].sum() Is there a faster and more efficient way to do this?
Let's try rolling on date with groupby: # make sure the data is sorted properly # your sample is already sorted, so you can skip this df = df.sort_values(['card_id', 'date']) df['sum_past_7D'] = (df.set_index('date').groupby('card_id') ['amount'].rolling('7D').sum() .groupby('card_id').shift(fill_value=0) .values ) Output: card_id date amount sum_past_7D 0 1 2020-06-07 11 0.0 1 1 2020-06-11 45 11.0 2 1 2020-06-14 87 56.0 3 2 2020-06-04 48 0.0 4 2 2020-06-12 76 48.0
how to group a string in a column in python?
i have a dataframe PROD TYPE QUANTI 0 wood i2 20 1 tv ut1 30 2 tabl il3 50 3 rmt z1 40 4 zet u1 60 5 rm t1 60 6 rt t2 80 7 dud i4 40 I want to group the column "TYPE" in-group categories of (i,u,z,y...etc) Expected Output PROD TYPE QUANTI 0 wood i_group 20 1 tv ut_group 30 2 tabl il_group 50 3 rmt z_group 40 4 zet y_group 60 5 rm t_group 60 6 rt t_group 80 7 dud i_group 40
Use Series.replace for replace number to _group: df['TYPE'] = df['TYPE'].replace('\d+', '_group', regex=True) print (df) PROD TYPE QUANTI 0 wood i_group 20 1 tv ut_group 30 2 tabl il_group 50 3 rmt z_group 40 4 zet u_group 60 5 rm t_group 60 6 rt t_group 80 7 dud i_group 40 If possible some values with no number use: df['TYPE'] = df['TYPE'].replace('\d+', '', regex=True) + '_group'
Grouping data based on month-year in pandas and then dropping all entries except the latest one- Python
Below is my example dataframe Date Indicator Value 0 2000-01-30 A 30 1 2000-01-31 A 40 2 2000-03-30 C 50 3 2000-02-27 B 60 4 2000-02-28 B 70 5 2000-03-31 C 90 6 2000-03-28 C 100 7 2001-01-30 A 30 8 2001-01-31 A 40 9 2001-03-30 C 50 10 2001-02-27 B 60 11 2001-02-28 B 70 12 2001-03-31 C 90 13 2001-03-28 C 100 Desired Output Date Indicator Value 2000-01-31 A 40 2000-02-28 B 70 2000-03-31 C 90 2001-01-31 A 40 2001-02-28 B 70 2001-03-31 C 90 I want to write a code that groups data by particular month-year and then keep the entry of latest date in that particular month-year and drop the rest. The data is till year 2020 I was only able to fetch the count by month-year. I am not able to drop create a proper code that helps to group data as per month-year and indicator and get the correct results
Use Series.dt.to_period for months periods, aggregate index of maximal date per groups by DataFrameGroupBy.idxmax and then pass to DataFrame.loc: df['Date'] = pd.to_datetime(df['Date']) print (df['Date'].dt.to_period('m')) 0 2000-01 1 2000-01 2 2000-03 3 2000-02 4 2000-02 5 2000-03 6 2000-03 7 2001-01 8 2001-01 9 2001-03 10 2001-02 11 2001-02 12 2001-03 13 2001-03 Name: Date, dtype: period[M] df = df.loc[df.groupby(df['Date'].dt.to_period('m'))['Date'].idxmax()] print (df) Date Indicator Value 1 2000-01-31 A 40 4 2000-02-28 B 70 5 2000-03-31 C 90 8 2001-01-31 A 40 11 2001-02-28 B 70 12 2001-03-31 C 90
How to convert multi-indexed datetime index into integer?
I have a multi indexed dataframe(groupby object) as the result of groupby (by 'id' and 'date'). x y id date abc 3/1/1994 100 7 9/1/1994 90 8 3/1/1995 80 9 bka 5/1/1993 50 8 7/1/1993 40 9 I'd like to convert those dates into an integer-like, such as x y id date abc day 0 100 7 day 1 90 8 day 2 80 9 bka day 0 50 8 day 1 40 9 I thought it would be simple but I couldn't get there easily. Is there a simple way to work on this?
Try this: s = 'day ' + df.groupby(level=0).cumcount().astype(str) df1 = df.set_index([s], append=True).droplevel(1) x y id abc day 0 100 7 day 1 90 8 day 2 80 9 bka day 0 50 8 day 1 40 9
You can calculate the new level and create a new index: lvl1 = 'day ' + df.groupby('id').cumcount().astype('str') df.index = pd.MultiIndex.from_tuples((x,y) for x,y in zip(df.index.get_level_values('id'), lvl1) ) output: x y abc day 0 100 7 day 1 90 8 day 2 80 9 bka day 0 50 8 day 1 40 9
Taking all duplicate values in column as single value in pandas
My current dataframe is: Name term Grade 0 A 1 35 1 A 2 40 2 B 1 50 3 B 2 45 I want to get a dataframe as: Name term Grade 0 A 1 35 2 40 1 B 1 50 2 45 Is i possible to get like my expected output?If yes,How can i do it?
Use duplicated for boolean mask with numpy.where: mask = df['Name'].duplicated() #more general #mask = df['Name'].ne(df['Name'].shift()).cumsum().duplicated() df['Name'] = np.where(mask, '', df['Name']) print (df) Name term Grade 0 A 1 35 1 2 40 2 B 1 50 3 2 45 Difference between masks is possible seen in changed DataFrame: print (df) Name term Grade 0 A 1 35 1 A 2 40 2 B 1 50 3 B 2 45 4 A 4 43 5 A 3 46 If multiple same consecutive groups like 2 A groups need general solution: mask = df['Name'].ne(df['Name'].shift()).cumsum().duplicated() df['Name'] = np.where(mask, '', df['Name']) print (df) Name term Grade 0 A 1 35 1 2 40 2 B 1 50 3 2 45 4 A 4 43 5 3 46 mask = df['Name'].duplicated() df['Name'] = np.where(mask, '', df['Name']) print (df) Name term Grade 0 A 1 35 1 2 40 2 B 1 50 3 2 45 4 4 43 5 3 46