How to get the top values within each group? - python-3.x

I am new to Pandas and I have a dataset that looks something like this.
s_name Time p_name qty
A 12/01/2019 ABC 1
A 12/01/2019 ABC 1
A 12/01/2019 DEF 2
A 12/01/2019 DEF 2
A 12/01/2019 FGH 0
B 13/02/2019 ABC 3
B 13/02/2019 DEF 1
B 13/02/2019 DEF 1
B 13/03/2019 ABC 3
B 13/03/2019 FGH 0
I am trying to group by s_name and find the sum of the qty of each unique p_name in a month but only display the p_name with the top 2 most quantities. Below is an example of how I want the final output to look like.
s_name Time p_name qty
A 01 DEF 4
A 01 ABC 2
B 02 ABC 3
B 02 DEF 2
B 03 ABC 2
B 03 FGH 0
Do you have any ideas? I have been stuck here for quite long so much help is appreciated.

Create a month using dt, then group by s_name and month, then apply a function to the groups, group each group by name and do a sum over qty, sort_values descending and only get the first two rows with head:
df.Time = pd.to_datetime(df.Time, format='%d/%m/%Y')
df['month'] = df.Time.dt.month
df_f = df.groupby(['s_name', 'month']).apply(
lambda df:
df.groupby('p_name').qty.sum()
.sort_values(ascending=False).head(2)
).reset_index()
df_f
# s_name month p_name qty
# 0 A 1 DEF 4
# 1 A 1 ABC 2
# 2 B 2 ABC 3
# 3 B 2 DEF 2
# 4 B 3 ABC 3
# 5 B 3 FGH 0

I am new to Pandas myself. I am going to attempt to answer your question.
See this code.
from io import StringIO
import pandas as pd
columns = "s_name Time p_name qty"
# Create dataframe from text.
df = pd.read_csv(
StringIO(
f"""{columns}
A 12/01/2019 ABC 1
A 12/01/2019 ABC 1
A 12/01/2019 DEF 2
A 12/01/2019 DEF 2
A 12/01/2019 FGH 0
B 13/02/2019 ABC 3
B 13/02/2019 DEF 1
B 13/02/2019 DEF 1
B 13/03/2019 ABC 3
B 13/03/2019 FGH 0"""
),
sep=" ",
)
S_NAME, TIME, P_NAME, QTY = columns.split()
MONTH = "month"
# Convert the TIME col to datetime types.
df.Time = pd.to_datetime(df.Time, dayfirst=True)
# Create a month column with zfilled strings.
df[MONTH] = df.Time.apply(lambda x: str(x.month).zfill(2))
# Group
group = df.groupby(by=[S_NAME, P_NAME, MONTH])
gdf = (
group.sum()
.sort_index()
.sort_values(by=[S_NAME, MONTH, QTY], ascending=False)
.reset_index()
)
gdf.groupby([S_NAME, MONTH]).head(2).sort_values(by=[S_NAME, MONTH]).reset_index()
Is this the result you expected?

Related

Drop by multiple columns groups if specific values not exit in another column in Pandas

How can I drop the whole group by city and district if date's value of 2018/11/1 not exits in the following dataframe:
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
3 b d 2018/9/1 3
4 b d 2018/10/1 7
The expected result will like this:
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
Thank you!
Create helper column by DataFrame.assign, compare by datetime and test if at least one true per groups with GroupBy.any and GroupBy.transform for possible filter by boolean indexing:
mask = (df.assign(new=df['date'].eq('2018/11/1'))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
If error with misisng values in mask one possivle idea is replace misisng values in columns used for groups:
mask = (df.assign(new=df['date'].eq('2018/11/1'),
city= df['city'].fillna(-1),
district= df['district'].fillna(-1))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
Another idea is add possible misisng index values by reindex and also replace missing values to False:
mask = (df.assign(new=df['date'].eq('2018/11/1'))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask.reindex(df.index, fill_value=False).fillna(False)]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
There's a special GroupBy.filter() method for this. Assuming date is already datetime:
filter_date = pd.Timestamp('2018-11-01').date()
df = df.groupby(['city', 'district']).filter(lambda x: (x['date'].dt.date == filter_date).any())

convert date to week and count the dependencies from different columns of a dataframe

I have a dataframe like this:
date Company Email
2019-10-07 abc mr1#abc.com
2019-10-07 def mr1#def.com
2019-10-07 abc mr1#abc.com
2019-10-08 xyz mr1#xyz.com
2019-10-08 abc mr2#abc.com
2019-10-15 xyz mr2#xyz.com
2019-10-15 def mr1#def.com
2019-10-17 xyz mr1#xyz.com
2019-10-17 abc mr2#abc.com
I have to create 2 dataframes like this:
dataframe 1:
Weeks abc def xyz
octoter7-october14 3 1 1
october15-0ctober22 1 1 2
and dataframe2: Unique count for Emails as well weekwise
Weeks Company Email_ID count
octoter7-october14 abc mr1#abc.com 2
mr2#abc.com 1
def mr1#def.com 1
xyz mr1#xyz.com 1
october15-october22 abc mr2#abc.com 1
def mr1#def.com 1
xyz mr1#xyz.com 1
mr2#xyz.com 1
Below is the code what i tried to create dataframe1 :
df1['Date'] = pd.to_datetime(df1['date']) - pd.to_timedelta(7, unit='d')
df1 = df1.groupby(['Company', pd.Grouper(key='Date', freq='W-MON')])['Email_ID'].count().sum().reset_index().sort_values('Date') ```
Company Date Email_ID
abc 2019-10-07 mr1#abc.com.mr1#abc.com.mr2#abc.com
def 2019-10-07 mr1#def.com
xyz 2019-10-07 mr1#xyz.com
abc 2019-10-15 mr2#abc.com
def 2019-10-15 mr1#def.com
xyz 2019-10-15 mr1#xyz.com.mr2#xyz.com ```
Here the sum is concatenating Email_ID strings instead of numerical counts and not able to represent my data as I want in dataframe1 and dataframe2
Please provide insights on how i can represent my data in as dataframe1 and dataframe2
For Grouper need datetimes, so format of datetimes is changed by MultiIndex.set_levels after aggregation and also added closed='left' for left closing bins:
df1['date'] = pd.to_datetime(df1['date'])
df2 = df1.groupby([pd.Grouper(key='date', freq='W-MON', closed='left'),
'Company',
'Email'])['Email'].count()
new = ((df2.index.levels[0] - pd.to_timedelta(7, unit='d')).strftime('%B%d') + ' - '+
df2.index.levels[0].strftime('%B%d') )
df2.index = df2.index.set_levels(new, level=0)
print (df2)
date Company Email
October07 - October14 abc mr1#abc.com 2
mr2#abc.com 1
def mr1#def.com 1
xyz mr1#xyz.com 1
October14 - October21 abc mr2#abc.com 1
def mr1#def.com 1
xyz mr1#xyz.com 1
mr2#xyz.com 1
Name: Email, dtype: int64
For first DataFrame use sum per first and second levels and reshape by Series.unstack:
df3 = df2.sum(level=[0,1]).unstack(fill_value=0)
print (df3)
Company abc def xyz
date
October07 - October14 3 1 1
October14 - October21 1 1 2
Add a column that maps the date to a week
Do something with grouping: i.e. df.groupby(df.week).count()

Prevent column name from disappearing after using replace on dataframe

So I have a real dataframe that somewhat follows the next structure:
d = {'col1':['1_ABC','2_DEF','3 GHI']}
df = pd.DataFrame(data=d)
Basically, some entries have the " _ ", others have " ".
My goal is to split that first number into a new column and keep the rest. For this, I thought I'd first replace the '_' by ' ' to normalize everything, and then simply split by ' ' to get the new column.
#Replace the '_' for ' '
new_df['Name'] = df['Name'].str.replace('_',' ')
My problem is that now my new_df now lost its column name:
0 1 ABC
1 2 DEF
Any way to prevent this from happening?
Thanks!
Function str.replace return Series, so there is no column name, only Series name.
s = df['col1'].str.replace('_',' ')
print (s)
0 1 ABC
1 2 DEF
2 3 GHI
Name: col1, dtype: object
print (type(s))
<class 'pandas.core.series.Series'>
print (s.name)
col1
If need new column assign to same DataFrame - df['Name']:
df['Name'] = df['col1'].str.replace('_',' ')
print (df)
col1 Name
0 1_ABC 1 ABC
1 2_DEF 2 DEF
2 3 GHI 3 GHI
Or overwrite values of original column:
df['col1'] = df['col1'].str.replace('_',' ')
print (df)
col1
0 1 ABC
1 2 DEF
2 3 GHI
If need new one column DataFrame use Series.to_frame for convert Series to df:
df2 = df['col1'].str.replace('_',' ').to_frame()
print (df2)
col1
0 1 ABC
1 2 DEF
2 3 GHI
Also is possible define new column name:
df1 = df['col1'].str.replace('_',' ').to_frame('New')
print (df1)
New
0 1 ABC
1 2 DEF
2 3 GHI
Like #anky_91 commented, if need new 2 columns add str.split:
df1 = df['col1'].str.replace('_',' ').str.split(expand=True)
df1.columns = ['A','B']
print (df1)
A B
0 1 ABC
1 2 DEF
2 3 GHI
If need add columns to existing DataFrame:
df[['A','B']] = df['col1'].str.replace('_',' ').str.split(expand=True)
print (df)
col1 A B
0 1_ABC 1 ABC
1 2_DEF 2 DEF
2 3 GHI 3 GHI

Counter number of date including between two date

I have a data set like this:
ID date value_1 value_2 tech start_date last_date
ab 2017-06-01 3476.44 324 A 2015-05-04 2018-06-01
ab 2017-07-01 3556.65 332 A 2016-06-07 2018-07-01
ab 2017-08-01 3552.65 120 B 2016-01-08 2018-01-01
ab 2017-09-01 3201.66 987 C 2015-04-08 2018-04-01
bc 2017-10-01 3059.02 652 C 2015-06-09 2018-03-01
bc 2017-11-01 2853.37 345 C 2018-01-01 2018-08-01
bc 2017-12-01 2871.29 554 C 2015-10-01 2018-01-01
I want to keep the ID and the tech fixed and count how many the date inclouding between start_date and last_date.
Like:
ID count
ab 4
ab 4
ab 4
ab 4
bc 2
bc 2
bc 2
I build an a function for do the count and next I do an a group by:
def count_c(data):
d = {}
d['count'] = np.sum(
[x > data['start_date '] & x < data['last_date '] for x in data['date']])
return pd.Series(d, index=['count'])
df_model1 = flag.groupby('date').apply(count_c)
Quite simple actually, instead of using a function use the datetime library and subtract each date.
import pandas as pd
import numpy as np
from datetime import datetime
df = pd.DataFrame(columns=['ID', 'date', 'value_1', 'value_2', 'tech', 'start_date', 'last_date']) # Your DataFrame
days_list = []
EDIT: Solution now counts the amount of rows in between start_date and end_date column
for i, row in df.iterrows():
s_date = datetime.strptime(row['start_date'], '%m/%d/%y')
e_date = datetime.strptime(row['last_date'],'%m/%d/%y')
days = abs((e_date - s_date).days)
days_list.append(days)
days_list = np.array(days_list)
df['Days'] = days_list
def dates(df):
"""
:param df: DataFrame
:param start_date: (str) mm/dd/yy
:param end_date: (str) mm/dd/yy
:return: number of rows
"""
n = 0
for _, ro in df.iterrows():
y = datetime.strptime(ro['start_date'], '%m/%d/%y')
t = datetime.strptime(ro['last_date'], '%m/%d/%y')
d = datetime.strptime(ro['date'], '%m/%d/%y')
if y < d < t:
n += 1
print(dates(df))

New column with in a Pandas Dataframe with respect to duplicates in given column

Hi i have a dataframe with a column "id" like below
id
abc
def
ghi
abc
abc
xyz
def
I need a new column "id1" with a number 1 appended to it and number should be incremented for every duplicate. output should be like below.
id id1
abc abc1
def def1
ghi ghi1
abc abc2
abc abc3
xyz xyz1
def def2
Can anyone suggest me a solution for this?
Use groupby.cumcount for count ids, add 1 and convert to strings:
df['id1'] = df['id'] + df.groupby('id').cumcount().add(1).astype(str)
print (df)
id id1
0 abc abc1
1 def def1
2 ghi ghi1
3 abc abc2
4 abc abc3
5 xyz xyz1
6 def def2
Detail:
print (df.groupby('id').cumcount())
0 0
1 0
2 0
3 1
4 2
5 0
6 1
dtype: int64

Resources