Groupby count of non NaN of another column and a specific calculation of the same columns in pandas - python-3.x

I have a data frame as shown below
ID Class Score1 Score2 Name
1 A 9 7 Xavi
2 B 7 8 Alba
3 A 10 8 Messi
4 A 8 10 Neymar
5 A 7 8 Mbappe
6 C 4 6 Silva
7 C 3 2 Pique
8 B 5 7 Ramos
9 B 6 7 Serge
10 C 8 5 Ayala
11 A NaN 4 Casilas
12 A NaN 4 De_Gea
13 B NaN 2 Seaman
14 C NaN 7 Chilavert
15 B NaN 3 Courtous
From the above, I would like to calculate the number of players with scoer1 less than or equal to 6 in each Class along with count of non NaN rows (Class wise)
Expected output:
Class Total_Number Count_Non_NaN Score1_less_than_6_# Avg_score1
A 6 4 0 8.5
B 5 3 2 6
C 4 3 2 5
tried below code
df2 = df.groupby('Class').agg(Total_Number = ('Score1','size'),
Score1_less_than_6 = ('Score1',lambda x: x.between(0,6).sum()),
Avg_score1 = ('Score1','mean'))
df2 = df2.reset_index()
df2

Groupby and aggregate using a dictionary
df['s'] = df['Score1'].le(6)
df.groupby('Class').agg(**{'total_number': ('Score1', 'size'),
'count_non_nan': ('Score1', 'count'),
'score1_less_than_six': ('s', 'sum'),
'avg_score1': ('Score1', 'mean')})
total_number count_non_nan score1_less_than_six avg_score1
Class
A 6 4 0 8.5
B 5 3 2 6.0
C 4 3 2 5.0

Try:
x = df.groupby("Class", as_index=False).agg(
Total_Number=("Class", "count"),
Count_Non_NaN=("Score1", lambda x: x.notna().sum()),
Score1_less_than_6=("Score1", lambda x: (x <= 6).sum()),
Avg_score1=("Score1", "mean"),
)
print(x)
Prints:
Class Total_Number Count_Non_NaN Score1_less_than_6 Avg_score1
0 A 6 4.0 0.0 8.5
1 B 5 3.0 2.0 6.0
2 C 4 3.0 2.0 5.0

Related

Split value of a row with equal values based on the unique entry

I have a dataframe which looks like this:
Name T1 T2 alpha
A 10 3 30
A 11 5 Nan
A 13 5 Nan
B 5 2 7
B 3 1 Nan
However need to divide the alpha column in equal parts to replace Nan values for each unique names in Name column like this: eg 30 for A becomes 10 each for corresponding row where A is present
Name T1 T2 alpha
A 10 3 10
A 11 5 10
A 13 5 10
B 5 2 3.5
B 3 1 3.5
I tried using explode but it is not working as I want it to look like, any idea here would help
With groupby and transform
df['alpha'] = pd.to_numeric(df['alpha'],errors = 'coerce').fillna(0).groupby(df['Name']).transform('mean')
df
Out[50]:
Name T1 T2 alpha
0 A 10 3 10.0
1 A 11 5 10.0
2 A 13 5 10.0
3 B 5 2 3.5
4 B 3 1 3.5
This'll get the job done:
df['alpha'] = (df['alpha'] / df['T2']).ffill()
Output:
Name T1 T2 alpha
0 A 10 3 10.0
1 A 11 5 10.0
2 A 13 5 10.0
3 B 5 2 3.5
4 B 3 1 3.5

Join with column having the max sequence number

I have a margin table
item margin
0 a 3
1 b 4
2 c 5
and an item table
item sequence
0 a 1
1 a 2
2 a 3
3 b 1
4 b 2
5 c 1
6 c 2
7 c 3
I want to join the two table so that the margin will only be joined to the product with maximum sequence number, the desired outcome is
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
How to achieve this?
Below is the code for margin and item table
import pandas as pd
df_margin=pd.DataFrame({"item":["a","b","c"],"margin":[3,4,5]})
df_item=pd.DataFrame({"item":["a","a","a","b","b","c","c","c"],"sequence":[1,2,3,1,2,1,2,3]})
One option would be to merge then replace extra values with NaN via Series.where:
new_df = df_item.merge(df_margin)
new_df['margin'] = new_df['margin'].where(
new_df.groupby('item')['sequence'].transform('max').eq(new_df['sequence'])
)
Or with loc:
new_df = df_item.merge(df_margin)
new_df.loc[new_df.groupby('item')['sequence']
.transform('max').ne(new_df['sequence']), 'margin'] = np.NAN
Another option would be to assign a temp column to both frames df_item with True where the value is maximal, and df_margin is True everywhere then merge outer and drop the temp column:
new_df = (
df_item.assign(
t=df_item
.groupby('item')['sequence']
.transform('max')
.eq(df_item['sequence'])
).merge(df_margin.assign(t=True), how='outer').drop('t', 1)
)
Both produce new_df:
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
You could do:
df_item.merge(df_item.groupby('item')['sequence'].max().\
reset_index().merge(df_margin), 'left')
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
Breakdown:
df_new = df_item.groupby('item')['sequence'].max().reset_index().merge(df_margin)
df_item.merge(df_new, 'left')

How to replace the missing values with average of ffill() and bfill() in pandas?

This is a sample dataframe and it containsNA:
x y z datetime
0 2 3 4 02-02-2019
1 NA NA NA 03-02-2019
2 3 5 7 04-02-2019
3 NA NA NA 05-02-2019
4 4 7 9 06-02-2019
Now, i want to fill these NA values and i can do this by using either ffill() or bfill(). But what if want to apply the average of the ffill() & bfill(). Then how can i do this?
The direct average df = (df.fill() + df.bfill()) / 2 didn't work because of datetime column.
The end dataframe should look like this:
x y z datetime
0 2 3 4 02-02-2019
1 2.5 4 5.5 03-02-2019
2 3 5 7 04-02-2019
3 3.5 6 8 05-02-2019
4 4 7 9 06-02-2019
Check with df.interpolate:
df.interpolate()
x y z datetime
0 2.0 3.0 4.0 02-02-2019
1 2.5 4.0 5.5 03-02-2019
2 3.0 5.0 7.0 04-02-2019
3 3.5 6.0 8.0 05-02-2019
4 4.0 7.0 9.0 06-02-2019

Working with two data frames with different size in python

I am working with two data frames.
The sample data is as follow:
DF = ['A','B','C','D','E','A','C','B','B']
DF1 = pd.DataFrame({'Team':DF})
DF2 = pd.DataFrame({'Team':['A','B','C','D','E'],'Rating':[1,2,3,4,5]})
i want to add a new column to DF1 as follow:
Team Rating
A 1
B 2
C 3
D 4
E 5
A 1
C 3
B 2
B 2
How can I add a new column?
I used
DF1['Rating']= np.where(DF1['Team']== DF2['Team'],DF2['Rating'],0)
Error : ValueError: Can only compare identically-labeled Series objects
Thanks
ZEP
I think need map by Series created with set_index and if not match get NaNs, so fillna was added for replace to 0:
DF1['Rating']= DF1['Team'].map(DF2.set_index('Team')['Rating']).fillna(0)
print (DF1)
Team Rating
0 A 1
1 B 2
2 C 3
3 D 4
4 E 5
5 A 1
6 C 3
7 B 2
8 B 2
DF = ['A','B','C','D','E','A','C','B','B', 'G']
DF1 = pd.DataFrame({'Team':DF})
DF2 = pd.DataFrame({'Team':['A','B','C','D','E'],'Rating':[1,2,3,4,5]})
DF1['Rating']= DF1['Team'].map(DF2.set_index('Team')['Rating']).fillna(0)
print (DF1)
Team Rating
0 A 1.0
1 B 2.0
2 C 3.0
3 D 4.0
4 E 5.0
5 A 1.0
6 C 3.0
7 B 2.0
8 B 2.0
9 G 0.0 <- G not in DF2['Team']
Detail:
print (DF1['Team'].map(DF2.set_index('Team')['Rating']))
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 1.0
6 3.0
7 2.0
8 2.0
9 NaN
Name: Team, dtype: float64
You can use:
In [54]: DF1['new_col'] = DF1.Team.map(DF2.set_index('Team').Rating)
In [55]: DF1
Out[55]:
Team new_col
0 A 1
1 B 2
2 C 3
3 D 4
4 E 5
5 A 1
6 C 3
7 B 2
8 B 2
i think you can use pd.merge
DF1=pd.merge(DF1,DF2,how='left',on='Team')
DF1
Team Rating
0 A 1
1 B 2
2 C 3
3 D 4
4 E 5
5 A 1
6 C 3
7 B 2
8 B 2

pandas moving aggregate string

from pandas import *
import StringIO
df = read_csv(StringIO.StringIO('''id months state
1 1 C
1 2 3
1 3 6
1 4 9
2 1 C
2 2 C
2 3 3
2 4 6
2 5 9
2 6 9
2 7 9
2 8 C
'''), delimiter= '\t')
I want to create a column show the cumulative state of column state, by id.
id months state result
1 1 C C
1 2 3 C3
1 3 6 C36
1 4 9 C369
2 1 C C
2 2 C CC
2 3 3 CC3
2 4 6 CC36
2 5 9 CC69
2 6 9 CC699
2 7 9 CC6999
2 8 C CC6999C
Basically the cum concatenation of string columns. What is the best way to do it?
So long as the dtype is str then you can do the following:
In [17]:
df['result']=df.groupby('id')['state'].apply(lambda x: x.cumsum())
df
Out[17]:
id months state result
0 1 1 C C
1 1 2 3 C3
2 1 3 6 C36
3 1 4 9 C369
4 2 1 C C
5 2 2 C CC
6 2 3 3 CC3
7 2 4 6 CC36
8 2 5 9 CC369
9 2 6 9 CC3699
10 2 7 9 CC36999
11 2 8 C CC36999C
Essentially we groupby on 'id' column and then apply a lambda with a transform to return the cumsum. This will perform a cumulative concatenation of the string values and return a Series with it's index aligned to the original df so you can add it as a column

Resources