Given a DataFrame like this:
A
B
C
D
0
ABC
unique_ident_1
10
ONE
1
KLM
unique_ident_2
2
TEN
2
KLM
unique_ident_2
7
TEN
3
XYZ
unique_ident_3
2
TWO
3
ABC
unique_ident_1
8
ONE
3
XYZ
unique_ident_3
-5
TWO
where column "B" contains a unique text identifier, columns "A" and "D" contain some constant texts dependent from unique id, and column C has a quantity. I want to group rows by unique identifiers (col "B") with quantity column summed up by ident:
A
B
C
D
0
ABC
unique_ident_1
18
ONE
1
KLM
unique_ident_2
9
TEN
2
XYZ
unique_ident_3
-3
TWO
How can I get this result with pandas?
use named tuples with a groupby.
df1 = df.groupby('B',as_index=False).agg(
A=('A','first'),
C=('C','sum'),
D=('D','first')
)[df.columns]
A B C D
0 ABC unique_ident_1 18 ONE
1 KLM unique_ident_2 9 TEN
2 XYZ unique_ident_3 -3 TWO
You can also create a dictionary and then group incase you have many columns:
agg_d = {col:'sum' if col=='C' else'first' for col in df.columns}
out = df.groupby('B').agg(agg_d).reset_index(drop=True)
print(out)
A B C D
0 ABC unique_ident_1 18 ONE
1 KLM unique_ident_2 9 TEN
2 XYZ unique_ident_3 -3 TWO
I have two data frames with structure as given below.
>>> df1
IID NAME TEXT
0 10 One AA,AB
1 11 Two AB,AC
2 12 Three AB
3 13 Four AC
>>> df2
IID TEXT
0 10 aa
1 10 ab
2 11 abc
3 11 a,c
4 11 ab
5 12 AA
6 13 AC
7 13 ad
8 13 abc
I want them to combine such that new data frame is a copy of df1 with the TEXT field appearing in df2 for the corresponding IID is appended to the TEXT field of df1 with duplicates removed (cases insensitive duplication check).
My expected output is
>>> df1
IID NAME TEXT
0 10 One AA,AB
1 11 Two AB,AC,ABC,A,C
2 12 Three AB,AA
3 13 Four AC,AD,ABC
I tried with groupby on df2, but how can I do the joint of the groupie object to a dataframe ?
I believe you need concat with groupby.agg to create the skeleton with duplicates , then series.explode with groupby+unique for de-duplicating
out = (pd.concat((df1,df2),sort=False).groupby('IID')
.agg({'NAME':'first','TEXT':','.join}).reset_index())
out['TEXT'] = (out['TEXT'].str.upper().str.split(',').explode()
.groupby(level=0).unique().str.join(','))
print(out)
IID NAME TEXT
0 10 One AA,AB
1 11 Two AB,AC,ABC,A,C
2 12 Three AB,AA
3 13 Four AC,AD,ABC
I took the reverse steps. First combined the rows having the same values to a list then merge and then combine the two columns into a single column.
df1:
IID NAME TEXT
0 10 One AA,AB
1 11 Two AB,AC
2 12 Three AB
3 13 Four AC
df2:
IID TEXT
0 10 aa
1 10 ab
2 11 abc
3 11 a,c
4 11 ab
5 12 AA
6 13 AC
7 13 ad
8 13 abc
df3 = pd.DataFrame(df2.groupby("IID")['TEXT'].apply(list).transform(lambda x: ','.join(x).upper()).reset_index())
df3:
IID TEXT
0 10 AA,AB
1 11 ABC,A,C,AB
2 12 AA
3 13 AC,AD,ABC
df4 = pd.merge(df1,df3,on='IID')
df4:
IID NAME TEXT_x TEXT_y
0 10 One AA,AB AA,AB
1 11 Two AB,AC ABC,A,C,AB
2 12 Three AB AA
3 13 Four AC AC,AD,ABC
df4['TEXT'] = df4[['TEXT_x','TEXT_y']].apply(
lambda x: ','.join(pd.unique(','.join(x).split(','))),
axis=1
)
df4.drop(['TEXT_x','TEXT_y'],axis=1)
OR
df5 = df1.assign(TEXT = df4.apply(
lambda x: ','.join(pd.unique(','.join(x[['TEXT_x','TEXT_y']]).split(','))),
axis=1))
df4/df5:
IID NAME TEXT
0 10 One AA,AB
1 11 Two AB,AC,ABC,A,C
2 12 Three AB,AA
3 13 Four AC,AD,ABC
I have two dataframes df1 and df2. Here is a small sample
Days
4
6
9
1
4
My df2 is
Day1 Day2 Alphabets
2 5 abc
4 7 bcd
8 10 ghi
10 12 abc
I want to change my df1 such that it has new column Alphabets from df2 if the days in df1 is between day1 and day2. Something like:
if df1['Days'] in between df2['Day1'] and df2['Day2']:
df1['Alphabets']=df2['Alphabets']
Result is:
Days Alphabets
4 abc
6 bcd
9 ghi
etc.
I tried for loop and its taking a lot of time even to run. Is there any other elegant way to do?
Thanks in advance
I will use numpy broadcast
s1=df2.Day1.values
s2=df2.Day2.values
s=df1.Days.values[:,None]
df1['V']=((s-s1>0)&(s-s2<0)).dot(df2.Alphabets)
df1
Out[277]:
Days V
0 4 abc
1 6 bcd
2 9 ghi
3 1
4 4 abc
I have a dataframe containing userIds, week number, and a column X as shown below:
I am trying to group by the userIds if X is greater than 3 for 3 weeks.
I have tried using groupby and lambda in pandas but I am stuck
weekly_X = df.groupby(['Userid','Week #'], as_index=False)
UserIds Week X
123 14 3
123 15 4
123 16 7
123 17 2
123 18 1
456 14 4
456 15 5
456 16 11
456 17 2
456 18 6
The result I am aiming for is a dataframe containing user 456 and how many weeks the condition occurred.
df_3 = df.groupby('UserIds').apply(lambda x: (x.X > 3).sum() > 3).to_frame('ID_want').reset_index()
df = df[df.UserIds.isin(df_3.loc[df_3.ID_want == 1,'UserIds'])]
Get counts of values greater like 3 with aggregate sum and then filter values greater like 3:
s = df['X'].gt(3).astype(int).groupby(df['UserIds']).sum()
out = s[s.gt(3)].reset_index(name='count')
print (out)
UserIds count
0 456 4
I am new to pandas. I have dataframe,df with 3 columns:(date),(name) and (count).
Given each day: is there an easy way to create a new dataframe from original one that contains new columns representing the unique names in the original (name column) and their respective count values in the correct columns?
date name count
0 2017-08-07 ABC 12
1 2017-08-08 ABC 5
2 2017-08-08 TTT 6
3 2017-08-09 TAC 5
4 2017-08-09 ABC 10
It should now be
date ABC TTT TAC
0 2017-08-07 12 0 0
1 2017-08-08 5 6 0
3 2017-08-09 10 0 5
df = pd.DataFrame({"date":["2017-08-07","2017-08-08","2017-08-08","2017-08-09","2017-08-09"],"name":["ABC","ABC","TTT","TAC","ABC"], "count":["12","5","6","5","10"]})
df = df.pivot(index='date', columns='name', values='count').reset_index().fillna(0)