Grouping and merging with pandas - python-3.x

I need some help with this.
I´ve to transform this dataframe without duplicates in column "name"
you can see that I have duplicates in column "name" ex:John , Joan
df0 = pd.DataFrame({'name':['John','John','Joan','Joan','Juan'],
'time':[20,10,11,18, 15],
'amount':[100, 400, 200, 100, 300]})
df0
name time amount
0 John 20 100
1 John 10 400
2 Joan 11 200
3 Joan 18 100
4 Juan 15 300
I need to transform this, grouping the dataframe in this way, I don´t know if is the right way.
dfend0 = df0.groupby('name').agg(lambda x: x.tolist())
dfend0
time amount
name
Joan [11, 18] [200, 100]
John [20, 10] [100, 400]
Juan [15] [300]
The column "name" is now the index, this isn´t the behavior I was looking for
list(dfend0.columns.values)
['time', 'amount']
#Now I need to merge with other dataframe
df1 = pd.DataFrame({
'name' : ['John' ,'Joan', 'Juan'],
'address' : ['streetA','streetB','streetC'],
'age' : [30,40,50]
})
df1
name address age
0 John streetA 30
1 Joan streetB 40
2 Juan streetC 50
ender = df1.merge(df0)
ender
name address age time amount
0 John streetA 30 20 100
1 John streetA 30 10 400
2 Joan streetB 40 11 200
3 Joan streetB 40 18 100
4 Juan streetC 50 15 300
This is not what I´m looking for, this example would be more accurate:
name address age time amount
0 John streetA 30 20,10 100,400
1 Joan streetB 40 11,18 200,100
2 Juan streetC 50 15 300
Any clue?

First, use as_index=False if you don't want the name as the index after the groupby operation.
2nd, there is no need for the lambda use .agg(list)
dfend0 = df0.groupby('name',as_index=False).agg(list)
then merge as usual.
df2 = pd.merge(df1,df0end,on='name')
name address age time amount
0 John streetA 30 [20, 10] [100, 400]
1 Joan streetB 40 [11, 18] [200, 100]
2 Juan streetC 50 [15] [300]
Note, if you dont want lists use (not recommended as you lost the underlying datatype and end up with a string)
df0end = df0.astype(str).groupby('name',as_index=False).agg(','.join)
name time amount
0 Joan 11,18 200,100
1 John 20,10 100,400
2 Juan 15 300
df2 = pd.merge(df1,df0end,on='name')
name address age time amount
0 John streetA 30 20,10 100,400
1 Joan streetB 40 11,18 200,100
2 Juan streetC 50 15 300

df = pd.DataFrame({'name':['John','John','Joan','Joan','Juan'],
'time':[20,10,11,18, 15],
'amount':[100, 400, 200, 100, 300]})
df=df.astype(str).groupby('name').agg({
'time':lambda x : ','.join(x),
'amount':lambda x : ','.join(x)
})
print(df)
time amount
name
Joan 11,18 200,100
John 20,10 100,400
Juan 15 300
at the end use df=df.merge(df2,on='name')

Related

How to get max min value in pandas dataframe

Hi i got the data frame like this
import pandas as pd
data = [(1,"tom", 23),
(1,"nick", 12),
(1,"jim",13),
(2,"tom", 44),
(2,"nick", 56),
(2,"jim",77),
(3, "tom", 88),
(3, "nick", 10),
(3, "jim", 13),
]
df = pd.DataFrame(data, columns=['class', 'Name','number']
output of this dataframe
class Name number
0 1 tom 23
1 1 nick 12
2 1 jim 13
3 2 tom 44
4 2 nick 56
5 2 jim 77
6 3 tom 88
7 3 nick 10
8 3 jim 1
how can i get the maximum number of in 3 different name of class 1 and get the number but in same name but different class the results can be look like this
[name =tom, class=1, number =23]
[name =tom, class=2, number =44]
[name =tom, class=3, number =88]
Thank you very much for helping me!
Find the name first from class 1, and then filter:
name = df.Name.loc[df[df['class'] == 1].number.idxmax()]
df[df.Name == name]
# class Name number
#0 1 tom 23
#3 2 tom 44
#6 3 tom 88
Try tis.
idx = df.groupby(['class'])['number'].transform(max) == df['number']
df[idx]

Finding the largest (N) proportion of percentage in pandas dataframe

Suppose I have the following df:
df = pd.DataFrame({'name':['Sara', 'John', 'Christine','Paul', 'Jo', 'Zack','Chris', 'Mathew', 'Suzan'],
'visits': [0, 0, 1,2, 3, 9,6, 10, 3]})
df
looks like:
name visits
0 Sara 0
1 John 0
2 Christine 1
3 Paul 2
4 Jo 3
5 Zack 9
6 Chris 6
7 Mathew 10
8 Suzan 3
I did some lines of code to get the percentage of visits per name and sort them descending:
df['percent'] = (df['visits'] / np.sum(df['visits']))
df.sort_values(by='percent', ascending=False).reset_index(drop=True)
Now I have got the percent of visits to total visits by all names:
name visits percent
0 Mathew 10 0.294118
1 Zack 9 0.264706
2 Chris 6 0.176471
3 Jo 3 0.088235
4 Suzan 3 0.088235
5 Paul 2 0.058824
6 Christine 1 0.029412
7 Sara 0 0.000000
8 John 0 0.000000
What I need to get is the largest proportion of names with the highest percentage. For example, the first 3 rows represent ~73% of the total visits, and others could be neglected compared sum of % of the first 3 rows.
I know I can select the top 3 by using nlargest:
df.nlargest(3, 'percent')
But there is high variability in the data and the largest proportion could be the first 2 or 3 rows or even more.
EDIT:
How can I do it automatically to find the largest(N) proportion of % out of the total count of rows?
You have to define outliers in some way. One way is to use scipy.stats.zscore like in this answer:
import pandas as pd
import numpy as np
from scipy import stats
df = pd.DataFrame({'name':['Sara', 'John', 'Christine','Paul', 'Jo', 'Zack','Chris', 'Mathew', 'Suzan'],
'visits': [0, 0, 1,2, 3, 9,6, 10, 3]})
df['percent'] = (df['visits'] / np.sum(df['visits']))
df.loc[df['percent'][stats.zscore(df['percent']) > 0.6].index]
which prints
name visits percent
5 Zack 9 0.264706
6 Chris 6 0.176471
7 Mathew 10 0.294118

Adding a grouped column header to an existing dataframe

How can we add to an existing Pandas dataframe a column header on a supplementary row above two sub column headers ? Here's the searched result:
Here's the current code which adds the CAPITAL header, but does not position it correctly.
import pandas as pd
OWNER = 'OWNER'
CAPITAL = 'CAPITAL'
USD = 'USD'
CHF = 'CHF'
YIELD = 'YIELD AMT'
df = pd.DataFrame({
OWNER: 2*['JOE']+3*['ROB'],
USD: [10000, 30000, 4000, 24000, 16000],
CHF: [9000, 27000, 3600, 21600, 14400],
YIELD: [100, 300, 40, 240, 160]
})
print(df)
'''
OWNER USD CHF YIELD AMT
0 JOE 10000 9000 100
1 JOE 30000 27000 300
2 ROB 4000 3600 40
3 ROB 24000 21600 240
4 ROB 16000 14400 160
'''
df.columns = pd.MultiIndex.from_product([[CAPITAL], df.columns])
print('\nUsing pd.from_product()')
print(df)
'''
CAPITAL
OWNER USD CHF YIELD AMT
0 JOE 10000 9000 100
1 JOE 30000 27000 300
2 ROB 4000 3600 40
3 ROB 24000 21600 240
4 ROB 16000 14400 160
'''
The solution is to use pd.MultiIndex.from_arrays() instead of pd.MultiIndex.from_product(). Here's the code:
import pandas as pd
OWNER = 'OWNER'
CAPITAL = 'CAPITAL'
USD = 'USD'
CHF = 'CHF'
YIELD = 'YIELD AMT'
df_ok = pd.DataFrame({
OWNER: 2*['JOE']+3*['ROB'],
USD: [10000, 30000, 4000, 24000, 16000],
CHF: [9000, 27000, 3600, 21600, 14400],
YIELD: [100, 300, 40, 240, 160]
})
df_ok.columns = pd.MultiIndex.from_arrays([[' ', ' ', CAPITAL, ' '], df_ok.columns])
print('\nUsing pd.from_arrays()')
print()
print(df_ok)
'''
CAPITAL
OWNER USD CHF YIELD AMT
0 JOE 10000 9000 100
1 JOE 30000 27000 300
2 ROB 4000 3600 40
3 ROB 24000 21600 240
4 ROB 16000 14400 160
'''

how to categorize salary into high/med/low group in python?

I have an employee dataset having salary details. I like to add an additional column to display their salary group like high/med/low:
Data:
Empno Sal Deptno
1 800 20
2 1600 30
3 2975 20
4 1250 30
5 2850 30
6 2450 10
7 3000 20
Expected Output:
Empno Sal Deptno Sal_Group
1 800 20 low
2 1600 30 mid
3 2975 20 ...
4 1250 30 ...
5 2850 30 ...
6 2450 10 ...
7 3000 20 high
You can try this:
import pandas as pd
import numpy as np
df = pd.read_csv("file.csv")
bins = np.linspace(min(df['Sal']), max(df['Sal']),4)
groupNames = ["low", "med", "high"]
df['SalGroup'] = pd.cut(df['Sal'], bins, labels = groupNames, include_lowest = True)
print(df)

Concate 2 dfs by a condition

I have 2 dfs
import pandas as pd
list_columns = ['Number', 'Name', 'Age']
list_data = [
[121, 'John', 25],
[122, 'Sam', 26]
]
df1 = pd.DataFrame(columns=list_columns, data=list_data)
Number Name Age
0 121 John 25
1 122 Sam 26
and
list_columns = ['Number', 'Name', 'Age']
list_data = [
[121, 'John', 31],
[122, 'Sam', 29],
[123, 'Andrew', 28]
]
df2 = pd.DataFrame(columns=list_columns, data=list_data)
Number Name Age
0 121 John 31
1 122 Sam 29
2 123 Andrew 28
In the end I want to take the missing values from df2 and add them into df1 bassed on the column Number.
In the above case in df1 I am missing only the Number 123, and I want to move only the data from this line to df1, so it will lok like
|Number|Name | Age|
| 121 |John | 25 |
| 122 |Sam | 26 |
| 123 |Andrew| 28 |
I tried to use concat with keep= 'First' but I am afraid that if a have lot of data it will alterate the existing data in df1(I want to add only missing data based on Number).
Is there a better way of achieving this?
this how I tried to concat
pd.concat([df1,df2]).drop_duplicates(['Number'],keep='first')
Use DataFrame.set_index on df1 and df2 to set the index as column Number and use DataFrame.combine_first:
df = (
df1.set_index('Number').combine_first(
df2.set_index('Number')).reset_index()
)
Result:
Number Name Age
0 121 John 25.0
1 122 Sam 26.0
2 123 Andrew 28.0

Resources