How get ranges of one column gruop by class column? In Pandas - python-3.x

I'm practicing with Pandas and i want to get the ranges of a column from a dataframe by the values of another column.
An example dataset:
Points Grade
1 7.5 C
2 9.3 A
3 NaN A
4 1.3 F
5 8.7 B
6 9.5 A
7 7.9 C
8 4.5 F
9 8.0 B
10 6.8 D
11 5.0 D
I want group ranges of points for each grade so i can induce missing values.
For that goal i need gets something like this:
Grade Points
A [9.5, 9.3]
B [8.7, 8.0]
C [7.5, 7.0]
D [6.8, 5.0]
F [1.3, 4.5]
I can get it with for and that kinds of stuffs but is it possible with pandas in some easy way?
I tried all groupby combinations i know and nothing. Some suggestion?

You can first filter df with notnull and then groupby and tolist with reset_index:
print df
Points Grade
0 7.5 C
1 9.3 A
2 NaN A
3 1.3 F
4 8.7 B
5 9.5 A
6 7.9 C
7 4.5 F
8 8.0 B
9 6.8 D
10 5.0 D
print df['Points'].notnull()
0 True
1 True
2 False
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
Name: Points, dtype: bool
print df.loc[df['Points'].notnull()]
Points Grade
0 7.5 C
1 9.3 A
3 1.3 F
4 8.7 B
5 9.5 A
6 7.9 C
7 4.5 F
8 8.0 B
9 6.8 D
10 5.0 D
print df.loc[df['Points'].notnull()].groupby('Grade')['Points']
.apply(lambda x: x.tolist()).reset_index()
Grade Points
0 A [9.3, 9.5]
1 B [8.7, 8.0]
2 C [7.5, 7.9]
3 D [6.8, 5.0]
4 F [1.3, 4.5]

Related

Groupby count of non NaN of another column and a specific calculation of the same columns in pandas

I have a data frame as shown below
ID Class Score1 Score2 Name
1 A 9 7 Xavi
2 B 7 8 Alba
3 A 10 8 Messi
4 A 8 10 Neymar
5 A 7 8 Mbappe
6 C 4 6 Silva
7 C 3 2 Pique
8 B 5 7 Ramos
9 B 6 7 Serge
10 C 8 5 Ayala
11 A NaN 4 Casilas
12 A NaN 4 De_Gea
13 B NaN 2 Seaman
14 C NaN 7 Chilavert
15 B NaN 3 Courtous
From the above, I would like to calculate the number of players with scoer1 less than or equal to 6 in each Class along with count of non NaN rows (Class wise)
Expected output:
Class Total_Number Count_Non_NaN Score1_less_than_6_# Avg_score1
A 6 4 0 8.5
B 5 3 2 6
C 4 3 2 5
tried below code
df2 = df.groupby('Class').agg(Total_Number = ('Score1','size'),
Score1_less_than_6 = ('Score1',lambda x: x.between(0,6).sum()),
Avg_score1 = ('Score1','mean'))
df2 = df2.reset_index()
df2
Groupby and aggregate using a dictionary
df['s'] = df['Score1'].le(6)
df.groupby('Class').agg(**{'total_number': ('Score1', 'size'),
'count_non_nan': ('Score1', 'count'),
'score1_less_than_six': ('s', 'sum'),
'avg_score1': ('Score1', 'mean')})
total_number count_non_nan score1_less_than_six avg_score1
Class
A 6 4 0 8.5
B 5 3 2 6.0
C 4 3 2 5.0
Try:
x = df.groupby("Class", as_index=False).agg(
Total_Number=("Class", "count"),
Count_Non_NaN=("Score1", lambda x: x.notna().sum()),
Score1_less_than_6=("Score1", lambda x: (x <= 6).sum()),
Avg_score1=("Score1", "mean"),
)
print(x)
Prints:
Class Total_Number Count_Non_NaN Score1_less_than_6 Avg_score1
0 A 6 4.0 0.0 8.5
1 B 5 3.0 2.0 6.0
2 C 4 3.0 2.0 5.0

Rearrange columns in DataFrame

Having a DataFrame structured as follows:
country A B C D
0 Albany 5.2 4.7 253.75 4
1 China 7.5 3.4 280.72 3
2 Portugal 4.6 7.5 320.00 6
3 France 8.4 3.6 144.00 3
4 Greece 2.1 10.0 331.00 6
I wanted to get something like this:
cost A B
country C D C D
Albany 2.05 4 1.85 4
China 2.67 3 1.21 3
Portugal 1.44 6 2.34 6
France 5.83 3 2.50 3
Greece 0.63 6 3.02 6
I mean, get the columns A and B as headers over C and D, keeping D the same with its constant value, and calculating in C the percentage resulting of the header over C. Example for Albany:
value C in A: (5.2/253.75)*100 = 2.05
value C in B: (4.7/253.75)*100 = 1.85
Is there any way to do it?
Thanks!
You can divide multiple columns, here A and B by DataFrame.div, then DataFrame.reindex by MultiIndex created by MultiIndex.from_product and last set D columns by original with MultiIndex slicers:
cols = ['A','B']
mux = pd.MultiIndex.from_product([cols, ['C', 'D']])
df1 = df[cols].div(df['C'], axis=0).mul(100).reindex(mux, axis=1, level=0)
idx = pd.IndexSlice
df1.loc[:, idx[:, 'D']] = df[['D'] * len(cols)].to_numpy()
#pandas bellow 0.24
#df1.loc[:, idx[:, 'D']] = df[['D'] * len(cols)].values
print (df1)
A B
C D C D
0 2.049261 4 1.852217 4
1 2.671701 3 1.211171 3
2 1.437500 6 2.343750 6
3 5.833333 3 2.500000 3
4 0.634441 6 3.021148 6

How to replace the missing values with average of ffill() and bfill() in pandas?

This is a sample dataframe and it containsNA:
x y z datetime
0 2 3 4 02-02-2019
1 NA NA NA 03-02-2019
2 3 5 7 04-02-2019
3 NA NA NA 05-02-2019
4 4 7 9 06-02-2019
Now, i want to fill these NA values and i can do this by using either ffill() or bfill(). But what if want to apply the average of the ffill() & bfill(). Then how can i do this?
The direct average df = (df.fill() + df.bfill()) / 2 didn't work because of datetime column.
The end dataframe should look like this:
x y z datetime
0 2 3 4 02-02-2019
1 2.5 4 5.5 03-02-2019
2 3 5 7 04-02-2019
3 3.5 6 8 05-02-2019
4 4 7 9 06-02-2019
Check with df.interpolate:
df.interpolate()
x y z datetime
0 2.0 3.0 4.0 02-02-2019
1 2.5 4.0 5.5 03-02-2019
2 3.0 5.0 7.0 04-02-2019
3 3.5 6.0 8.0 05-02-2019
4 4.0 7.0 9.0 06-02-2019

Working with two data frames with different size in python

I am working with two data frames.
The sample data is as follow:
DF = ['A','B','C','D','E','A','C','B','B']
DF1 = pd.DataFrame({'Team':DF})
DF2 = pd.DataFrame({'Team':['A','B','C','D','E'],'Rating':[1,2,3,4,5]})
i want to add a new column to DF1 as follow:
Team Rating
A 1
B 2
C 3
D 4
E 5
A 1
C 3
B 2
B 2
How can I add a new column?
I used
DF1['Rating']= np.where(DF1['Team']== DF2['Team'],DF2['Rating'],0)
Error : ValueError: Can only compare identically-labeled Series objects
Thanks
ZEP
I think need map by Series created with set_index and if not match get NaNs, so fillna was added for replace to 0:
DF1['Rating']= DF1['Team'].map(DF2.set_index('Team')['Rating']).fillna(0)
print (DF1)
Team Rating
0 A 1
1 B 2
2 C 3
3 D 4
4 E 5
5 A 1
6 C 3
7 B 2
8 B 2
DF = ['A','B','C','D','E','A','C','B','B', 'G']
DF1 = pd.DataFrame({'Team':DF})
DF2 = pd.DataFrame({'Team':['A','B','C','D','E'],'Rating':[1,2,3,4,5]})
DF1['Rating']= DF1['Team'].map(DF2.set_index('Team')['Rating']).fillna(0)
print (DF1)
Team Rating
0 A 1.0
1 B 2.0
2 C 3.0
3 D 4.0
4 E 5.0
5 A 1.0
6 C 3.0
7 B 2.0
8 B 2.0
9 G 0.0 <- G not in DF2['Team']
Detail:
print (DF1['Team'].map(DF2.set_index('Team')['Rating']))
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 1.0
6 3.0
7 2.0
8 2.0
9 NaN
Name: Team, dtype: float64
You can use:
In [54]: DF1['new_col'] = DF1.Team.map(DF2.set_index('Team').Rating)
In [55]: DF1
Out[55]:
Team new_col
0 A 1
1 B 2
2 C 3
3 D 4
4 E 5
5 A 1
6 C 3
7 B 2
8 B 2
i think you can use pd.merge
DF1=pd.merge(DF1,DF2,how='left',on='Team')
DF1
Team Rating
0 A 1
1 B 2
2 C 3
3 D 4
4 E 5
5 A 1
6 C 3
7 B 2
8 B 2

Multiple columns difference of 2 Pandas DataFrame

I am new to Python and Pandas , can someone help me with below report.
I want to report difference of N columns and create new columns with difference value, is it possible to make it dynamic as I have more than 30 columns. (Columns are fixed numbers, rows values can change)
A and B can be Alpha numeric
Use join with sub for difference of DataFrames:
#if columns are strings, first cast it
df1 = df1.astype(int)
df2 = df2.astype(int)
#if first columns are not indices
#df1 = df1.set_index('ID')
#df2 = df2.set_index('ID')
df = df1.join(df2.sub(df1).add_prefix('sum'))
print (df)
A B sumA sumB
ID
0 10 2.0 5 3.0
1 11 3.0 6 5.0
2 12 4.0 7 5.0
Or similar:
df = df1.join(df2.sub(df1), rsuffix='sum')
print (df)
A B Asum Bsum
ID
0 10 2.0 5 3.0
1 11 3.0 6 5.0
2 12 4.0 7 5.0
Detail:
print (df2.sub(df1))
A B
ID
0 5 3.0
1 6 5.0
2 7 5.0
IIUC
df1[['C','D']]=(df2-df1)[['A','B']]
df1
Out[868]:
ID A B C D
0 0 10 2.0 5 3.0
1 1 11 3.0 6 5.0
2 2 12 4.0 7 5.0
df1.assign(B=0)
Out[869]:
ID A B C D
0 0 10 0 5 3.0
1 1 11 0 6 5.0
2 2 12 0 7 5.0
The 'ID' column should really be an index. See the Pandas tutorial on indexing for why this is a good idea.
df1 = df1.set_index('ID')
df2 = df2.set_index('ID')
df = df1.copy()
df[['C', 'D']] = df2 - df1
df['B'] = 0
print(df)
outputs
A B C D
ID
0 10 0 5 3.0
1 11 0 6 5.0
2 12 0 7 5.0

Resources