Pandas: Sort a dataframe based on multiple columns - python-3.x

I know that this question has been asked several times. But none of the answers match my case.
I've a pandas dataframe with columns,department and employee_count. I need to sort the employee_count column in descending order. But if there is a tie between 2 employee_counts then they should be sorted alphabetically based on department.
Department Employee_Count
0 abc 10
1 adc 10
2 bca 11
3 cde 9
4 xyz 15
required output:
Department Employee_Count
0 xyz 15
1 bca 11
2 abc 10
3 adc 10
4 cde 9
This is what I've tried.
df = df.sort_values(['Department','Employee_Count'],ascending=[True,False])
But this just sorts the departments alphabetically.
I've also tried to sort by Department first and then by Employee_Count. Like this:
df = df.sort_values(['Department'],ascending=[True])
df = df.sort_values(['Employee_Count'],ascending=[False])
This doesn't give me correct output either:
Department Employee_Count
4 xyz 15
2 bca 11
1 adc 10
0 abc 10
3 cde 9
It gives 'adc' first and then 'abc'.
Kindly help me.

You can swap columns in list and also values in ascending parameter:
Explanation:
Order of columns names is order of sorting, first sort descending by Employee_Count and if some duplicates in Employee_Count then sorting by Department only duplicates rows ascending.
df1 = df.sort_values(['Employee_Count', 'Department'], ascending=[False, True])
print (df1)
Department Employee_Count
4 xyz 15
2 bca 11
0 abc 10 <-
1 adc 10 <-
3 cde 9
Or for test if use second False then duplicated rows are sorting descending:
df2 = df.sort_values(['Employee_Count', 'Department',],ascending=[False, False])
print (df2)
Department Employee_Count
4 xyz 15
2 bca 11
1 adc 10 <-
0 abc 10 <-
3 cde 9

Related

How to group rows in pandas with sum in the certain column

Given a DataFrame like this:
A
B
C
D
0
ABC
unique_ident_1
10
ONE
1
KLM
unique_ident_2
2
TEN
2
KLM
unique_ident_2
7
TEN
3
XYZ
unique_ident_3
2
TWO
3
ABC
unique_ident_1
8
ONE
3
XYZ
unique_ident_3
-5
TWO
where column "B" contains a unique text identifier, columns "A" and "D" contain some constant texts dependent from unique id, and column C has a quantity. I want to group rows by unique identifiers (col "B") with quantity column summed up by ident:
A
B
C
D
0
ABC
unique_ident_1
18
ONE
1
KLM
unique_ident_2
9
TEN
2
XYZ
unique_ident_3
-3
TWO
How can I get this result with pandas?
use named tuples with a groupby.
df1 = df.groupby('B',as_index=False).agg(
A=('A','first'),
C=('C','sum'),
D=('D','first')
)[df.columns]
A B C D
0 ABC unique_ident_1 18 ONE
1 KLM unique_ident_2 9 TEN
2 XYZ unique_ident_3 -3 TWO
You can also create a dictionary and then group incase you have many columns:
agg_d = {col:'sum' if col=='C' else'first' for col in df.columns}
out = df.groupby('B').agg(agg_d).reset_index(drop=True)
print(out)
A B C D
0 ABC unique_ident_1 18 ONE
1 KLM unique_ident_2 9 TEN
2 XYZ unique_ident_3 -3 TWO

pandas combine a data frame with another groupby dataframe

I have two data frames with structure as given below.
>>> df1
IID NAME TEXT
0 10 One AA,AB
1 11 Two AB,AC
2 12 Three AB
3 13 Four AC
>>> df2
IID TEXT
0 10 aa
1 10 ab
2 11 abc
3 11 a,c
4 11 ab
5 12 AA
6 13 AC
7 13 ad
8 13 abc
I want them to combine such that new data frame is a copy of df1 with the TEXT field appearing in df2 for the corresponding IID is appended to the TEXT field of df1 with duplicates removed (cases insensitive duplication check).
My expected output is
>>> df1
IID NAME TEXT
0 10 One AA,AB
1 11 Two AB,AC,ABC,A,C
2 12 Three AB,AA
3 13 Four AC,AD,ABC
I tried with groupby on df2, but how can I do the joint of the groupie object to a dataframe ?
I believe you need concat with groupby.agg to create the skeleton with duplicates , then series.explode with groupby+unique for de-duplicating
out = (pd.concat((df1,df2),sort=False).groupby('IID')
.agg({'NAME':'first','TEXT':','.join}).reset_index())
out['TEXT'] = (out['TEXT'].str.upper().str.split(',').explode()
.groupby(level=0).unique().str.join(','))
print(out)
IID NAME TEXT
0 10 One AA,AB
1 11 Two AB,AC,ABC,A,C
2 12 Three AB,AA
3 13 Four AC,AD,ABC
I took the reverse steps. First combined the rows having the same values to a list then merge and then combine the two columns into a single column.
df1:
IID NAME TEXT
0 10 One AA,AB
1 11 Two AB,AC
2 12 Three AB
3 13 Four AC
df2:
IID TEXT
0 10 aa
1 10 ab
2 11 abc
3 11 a,c
4 11 ab
5 12 AA
6 13 AC
7 13 ad
8 13 abc
df3 = pd.DataFrame(df2.groupby("IID")['TEXT'].apply(list).transform(lambda x: ','.join(x).upper()).reset_index())
df3:
IID TEXT
0 10 AA,AB
1 11 ABC,A,C,AB
2 12 AA
3 13 AC,AD,ABC
df4 = pd.merge(df1,df3,on='IID')
df4:
IID NAME TEXT_x TEXT_y
0 10 One AA,AB AA,AB
1 11 Two AB,AC ABC,A,C,AB
2 12 Three AB AA
3 13 Four AC AC,AD,ABC
df4['TEXT'] = df4[['TEXT_x','TEXT_y']].apply(
lambda x: ','.join(pd.unique(','.join(x).split(','))),
axis=1
)
df4.drop(['TEXT_x','TEXT_y'],axis=1)
OR
df5 = df1.assign(TEXT = df4.apply(
lambda x: ','.join(pd.unique(','.join(x[['TEXT_x','TEXT_y']]).split(','))),
axis=1))
df4/df5:
IID NAME TEXT
0 10 One AA,AB
1 11 Two AB,AC,ABC,A,C
2 12 Three AB,AA
3 13 Four AC,AD,ABC

How do I compare a dataframe column with another dataframe and create a column

I have two dataframes df1 and df2. Here is a small sample
Days
4
6
9
1
4
My df2 is
Day1 Day2 Alphabets
2 5 abc
4 7 bcd
8 10 ghi
10 12 abc
I want to change my df1 such that it has new column Alphabets from df2 if the days in df1 is between day1 and day2. Something like:
if df1['Days'] in between df2['Day1'] and df2['Day2']:
df1['Alphabets']=df2['Alphabets']
Result is:
Days Alphabets
4 abc
6 bcd
9 ghi
etc.
I tried for loop and its taking a lot of time even to run. Is there any other elegant way to do?
Thanks in advance
I will use numpy broadcast
s1=df2.Day1.values
s2=df2.Day2.values
s=df1.Days.values[:,None]
df1['V']=((s-s1>0)&(s-s2<0)).dot(df2.Alphabets)
df1
Out[277]:
Days V
0 4 abc
1 6 bcd
2 9 ghi
3 1
4 4 abc

Find occurrences of conditional value from one column and count values from another column in a dataframe

I have a dataframe containing userIds, week number, and a column X as shown below:
I am trying to group by the userIds if X is greater than 3 for 3 weeks.
I have tried using groupby and lambda in pandas but I am stuck
weekly_X = df.groupby(['Userid','Week #'], as_index=False)
UserIds Week X
123 14 3
123 15 4
123 16 7
123 17 2
123 18 1
456 14 4
456 15 5
456 16 11
456 17 2
456 18 6
The result I am aiming for is a dataframe containing user 456 and how many weeks the condition occurred.
df_3 = df.groupby('UserIds').apply(lambda x: (x.X > 3).sum() > 3).to_frame('ID_want').reset_index()
df = df[df.UserIds.isin(df_3.loc[df_3.ID_want == 1,'UserIds'])]
Get counts of values greater like 3 with aggregate sum and then filter values greater like 3:
s = df['X'].gt(3).astype(int).groupby(df['UserIds']).sum()
out = s[s.gt(3)].reset_index(name='count')
print (out)
UserIds count
0 456 4

dataframe transformation python

I am new to pandas. I have dataframe,df with 3 columns:(date),(name) and (count).
Given each day: is there an easy way to create a new dataframe from original one that contains new columns representing the unique names in the original (name column) and their respective count values in the correct columns?
date name count
0 2017-08-07 ABC 12
1 2017-08-08 ABC 5
2 2017-08-08 TTT 6
3 2017-08-09 TAC 5
4 2017-08-09 ABC 10
It should now be
date ABC TTT TAC
0 2017-08-07 12 0 0
1 2017-08-08 5 6 0
3 2017-08-09 10 0 5
df = pd.DataFrame({"date":["2017-08-07","2017-08-08","2017-08-08","2017-08-09","2017-08-09"],"name":["ABC","ABC","TTT","TAC","ABC"], "count":["12","5","6","5","10"]})
df = df.pivot(index='date', columns='name', values='count').reset_index().fillna(0)

Resources